Day 30: Phase Analysis & Review

🎯 Enterprise Objective

Congratulations. You have reached Day 30. Today is the ultimate test of your Data Analytics skills. You will ingest messy data, clean it, merge it, aggregate it, and visualize the story it tells. This is exactly what professional Data Analysts do every day.

📋 Strategic Overview

#	Topic	Concept
1	Ingestion	Loading & Merging
2	Cleaning	NaNs, Types, & Outliers
3	Insight	Aggregation & Vis

1. Phase 2 Capstone : Exploratory Data Analysis

🔍 What is it?

You have mastered Pandas and Seaborn. You can clean missing data, merge tables, group statistics, and visualize trends. Today is the Phase 2 Capstone. You will perform a complete Exploratory Data Analysis (EDA) pipeline on a raw, messy dataset.

🎯 The Mission: Customer Churn Analysis

You will simulate loading a messy Customer CSV and a Transaction CSV. You will merge them, clean the data, calculate Customer Lifetime Value (CLV), and visualize which demographics are churning.

💼 Why Data Analysts Care

• Portfolio Piece: A complete end-to-end Jupyter Notebook analyzing data is the standard requirement for Data Analyst job interviews

• Real-world simulation: Data is never in one table, and it is never clean

⚠️ Jumping to Code

Don't start writing pd.read_csv() blindly. Write a markdown cell outlining your 5-step plan: 1. Load, 2. Clean, 3. Merge, 4. Feature Engineering, 5. Visualization.

In [ ]:

🧪 Concept Checks: `Capstone`

Q1. Plan the Cleaning Phase. What 3 Pandas methods will you use to handle missing data and wrong types?

In [ ]:

Q2. Plan the Merge Phase. Why would you use a LEFT JOIN (Customers -> Txns) instead of an INNER JOIN?

In [ ]:

Q3. Plan the Feature Engineering Phase. How will you calculate Total_Spent per user? (Hint: groupby().transform("sum")).

In [ ]:

Q4. Plan the Vis Phase. What Seaborn plot is best to show the distribution of Age? What about Total_Spent by Gender?

In [ ]:

Q5. Set up your notebook: Import pandas as pd, numpy as np, seaborn as sns, and matplotlib.pyplot as plt.

In [ ]:

2. Data Generation : Creating the Simulation

🔍 What is it?

Since we don't have external CSV files, we will generate synthetic data using NumPy and Pandas. Generating fake data is an excellent skill for testing pipelines.

We need two tables:

1. customers: ID, Age (with some NaNs), Region, Churned (Boolean).

2. transactions: Txn_ID, Cust_ID, Amount (with some '$' strings), Date.

💼 Why Data Analysts Care

• Testing Logic: If your code works on this synthetic data, it will work on the real data database extract.

In [ ]:

🧪 Concept Checks: `Simulation`

Q1. Execute the customer generation code above. Run cust_df.info() to see the missing values.

In [ ]:

Q2. Generate a txn_df with 500 rows. Txn_ID (1 to 500), Cust_ID (random 1 to 100).

In [ ]:

Q3. Add an Amount column to txn_df. Make 90% of them floats, and 10% of them messy strings like "$50".

In [ ]:

Q4. Add a Date column using pd.date_range() and np.random.choice().

In [ ]:

Q5. Verify both DataFrames exist and inspect their heads.

In [ ]:

3. Execution & Reporting : The Final Deliverable

🔍 What is it?

Execute the pipeline. Clean the tables, merge them, and answer the core business question: What factors drive Customer Churn?

You must produce at least 3 visualizations:

1. A Correlation Heatmap of numeric features.

2. A Barplot comparing Total Revenue by Region.

3. A Boxplot comparing Age distribution between Churned and Retained users.

In [ ]:

🧪 Concept Checks: `Execution`

Q1. Clean Amount in txn_df (remove $ and cast to float). Clean Age in cust_df (fill NaN with median).

In [ ]:

Q2. Merge cust_df and txn_df on ID using a LEFT JOIN.

In [ ]:

Q3. Create the Correlation Heatmap. Is Age correlated with Amount?

In [ ]:

Q4. Create the Barplot: x="Region", y="Amount", estimator=sum. Which region brings the most revenue?

In [ ]:

Q5. Create the Boxplot: x="Churned", y="Age". Do older users churn more?

In [ ]:

🛠️ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Feature Engineering): Create a new column Transaction_Count for each user. (Hint: group txn_df by Cust_ID, calculate .size(), then merge this metric into cust_df).

In [ ]:

Task 2 (Time Series Vis): Convert Date to a datetime object. Extract the Month. Group by Month and sum the Amount. Plot a Seaborn Lineplot of Revenue over Time.

In [ ]:

Task 3 (Cohort Analysis): Use pd.cut() to bin Age into 3 groups: 'Youth', 'Adult', 'Senior'. Create a Countplot (sns.countplot) with x='Age_Group' and hue='Churned' to see which demographic churns most.

In [ ]:

Task 4 (Outlier Removal): Before plotting revenue, find the 99th percentile of Amount using .quantile(0.99). Filter out any rows where Amount is strictly greater than this value to remove massive outliers.

In [ ]:

Task 5 (Executive Summary): Create a markdown cell at the very bottom of your notebook. Write 3 bullet points summarizing your findings (e.g., 'Seniors in the South region have the highest churn rate'). This is what executives actually read.

In [ ]:

💻 Pure Coding Interview Questions

Q1.

Walk me through your EDA process when you receive a brand new, undocumented dataset.

In [ ]:

Q2.

How do you decide whether to drop a column with 20% missing data, vs imputing the values?

In [ ]:

Q3.

Explain the difference between a Left Join and an Inner Join, and why you used one over the other in this project.

In [ ]:

Q4.

What does a Correlation coefficient of -0.85 tell you? How did you visualize this?

In [ ]:

Q5.

When grouping by Customer, what is the difference between summing their transactions vs counting their transactions?

In [ ]:

Q6.

Why is it important to check for duplicates before aggregating revenue?

In [ ]:

Q7.

What plot would you use to show the distribution of a single continuous variable (like Salary)?

In [ ]:

Q8.

What plot is best to compare the median and outliers of Salary across 5 different departments?

In [ ]:

Q9.

Explain what 'Data Leakage' is in the context of predictive modeling.

In [ ]:

Q10.

How do you handle a highly imbalanced dataset? (e.g., 99% retained, 1% churned).

In [ ]:

Q11.

What does `pd.to_datetime(errors='coerce')` do, and why is it safer than default parsing?

In [ ]:

Q12.

Write code to find the top 5 most valuable customers based on Total Spend.

In [ ]:

Q13.

How would you optimize this pipeline if the Transactions file was 50GB and couldn't fit in RAM? (Hint: chunking or Dask/PySpark).

In [ ]:

Q14.

Explain the 'Split-Apply-Combine' strategy you used to calculate Regional revenue.

In [ ]:

Q15.

What is the difference between `.agg(['mean', 'sum'])` and `.transform('mean')`?

In [ ]:

Q16.

How do you rename columns dynamically after a GroupBy operation flattens the index?

In [ ]:

Q17.

Why is Seaborn built on top of Matplotlib rather than replacing it?

In [ ]:

Q18.

Explain the `hue` parameter in Seaborn. How did it help your Churn analysis?

In [ ]:

Q19.

If you noticed a massive spike in revenue on one specific day in your Lineplot, how would you investigate it using Pandas?

In [ ]:

Q20.

What are some common pitfalls when using Pie Charts, and why do Data Scientists prefer Bar charts?

In [ ]:

Q21.

How would you save your final cleaned DataFrame to a CSV file without the index?

In [ ]:

Q22.

Explain how you would connect this Pandas script directly to a SQL database instead of reading CSVs.

In [ ]:

Q23.

What is feature engineering? Give an example of a feature you engineered in this project.

In [ ]:

Q24.

How do you present technical findings to a non-technical stakeholder?

In [ ]:

Q25.

Congratulations on completing the 30 Days of Python and Data Analytics. What was the hardest concept for you to grasp, and how did you overcome it?

In [ ]:

📊 Day 30 Executive Summary

#	Topic	Key Takeaway
1	Process	EDA is iterative. Clean, plot, discover an error, clean again.
2	Code	Chain operations logically. Don't mutate state wildly.
3	Value	The code doesn't matter. The business insight you derive is what matters.

✅ Instructor's End-of-Day Checklist

• [ ] I can clean and merge raw datasets.

• [ ] I can perform Exploratory Data Analysis.

• [ ] I can visualize and communicate business insights.

📊 Day 30 : Phase Analysis & Review

🎯 Enterprise Objective

📋 Strategic Overview

1. Phase 2 Capstone : Exploratory Data Analysis

🎯 The Mission: Customer Churn Analysis

💼 Why Data Analysts Care

⚠️ Jumping to Code

🧪 Concept Checks: Capstone

2. Data Generation : Creating the Simulation

💼 Why Data Analysts Care

🧪 Concept Checks: Simulation

3. Execution & Reporting : The Final Deliverable

🧪 Concept Checks: Execution

🛠️ Professional Practice Tasks

💻 Pure Coding Interview Questions

Walk me through your EDA process when you receive a brand new, undocumented dataset.

How do you decide whether to drop a column with 20% missing data, vs imputing the values?

Explain the difference between a Left Join and an Inner Join, and why you used one over the other in this project.

What does a Correlation coefficient of -0.85 tell you? How did you visualize this?

When grouping by Customer, what is the difference between summing their transactions vs counting their transactions?

Why is it important to check for duplicates before aggregating revenue?

What plot would you use to show the distribution of a single continuous variable (like Salary)?

What plot is best to compare the median and outliers of Salary across 5 different departments?

Explain what 'Data Leakage' is in the context of predictive modeling.

How do you handle a highly imbalanced dataset? (e.g., 99% retained, 1% churned).

What does pd.to_datetime(errors='coerce') do, and why is it safer than default parsing?

Write code to find the top 5 most valuable customers based on Total Spend.

How would you optimize this pipeline if the Transactions file was 50GB and couldn't fit in RAM? (Hint: chunking or Dask/PySpark).

Explain the 'Split-Apply-Combine' strategy you used to calculate Regional revenue.

What is the difference between .agg(['mean', 'sum']) and .transform('mean')?

How do you rename columns dynamically after a GroupBy operation flattens the index?

Why is Seaborn built on top of Matplotlib rather than replacing it?

Explain the hue parameter in Seaborn. How did it help your Churn analysis?

If you noticed a massive spike in revenue on one specific day in your Lineplot, how would you investigate it using Pandas?

What are some common pitfalls when using Pie Charts, and why do Data Scientists prefer Bar charts?

How would you save your final cleaned DataFrame to a CSV file without the index?

Explain how you would connect this Pandas script directly to a SQL database instead of reading CSVs.

What is feature engineering? Give an example of a feature you engineered in this project.

How do you present technical findings to a non-technical stakeholder?

Congratulations on completing the 30 Days of Python and Data Analytics. What was the hardest concept for you to grasp, and how did you overcome it?

📊 Day 30 Executive Summary

✅ Instructor's End-of-Day Checklist

🧪 Concept Checks: `Capstone`

🧪 Concept Checks: `Simulation`

🧪 Concept Checks: `Execution`

What does `pd.to_datetime(errors='coerce')` do, and why is it safer than default parsing?

What is the difference between `.agg(['mean', 'sum'])` and `.transform('mean')`?

Explain the `hue` parameter in Seaborn. How did it help your Churn analysis?