๐ Day 29 : Data Visualization
๐ฏ Enterprise Objective
Data is useless if stakeholders can't understand it. Seaborn translates complex Pandas aggregations into stunning, publication-ready visualizations. Today we master Histograms, Box plots, and the magical hue parameter to uncover hidden dimensions in our data.
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | Basic | Scatter & Line plots |
| 2 | Dist | Histograms & Boxplots |
| 3 | Hue | Multi-dimensional color |
1. Seaborn Basics : Statistical Visualization
Seaborn is a data visualization library built on top of Matplotlib. It integrates deeply with Pandas DataFrames and provides beautiful default themes. While Pandas is for crunching numbers, Seaborn is for communicating those numbers to stakeholders.
import seaborn as sns
import matplotlib.pyplot as plt
# Create a basic scatter plot
sns.scatterplot(data=df, x='Age', y='Salary')
plt.show() # Always call this to render the plot!
๐ผ Why Data Analysts Care
โข Storytelling: Transforming a boring table of 10,000 rows into a clear trend line showing revenue growth
โข Outlier Detection: Using Box plots to instantly spot anomalous transactions visually
โ ๏ธ Forgetting plt.show()
Seaborn draws the plot, but matplotlib.pyplot.show() is required to display it in many environments. Always import both libraries!
๐งช Concept Checks: Seaborn Intro
Q1. Import seaborn as sns and matplotlib.pyplot as plt.
Q2. Load Seaborn's built in dataset: tips = sns.load_dataset("tips"). Print tips.head().
Q3. Create a scatterplot mapping x="total_bill" and y="tip". Use data=tips.
Q4. Add a title using plt.title("Bill vs Tip").
Q5. Call plt.show() to render the plot.
2. Distributions & Categoricals : Understanding the Shape
Before modeling, you must understand the shape of your data. Histograms (histplot) show the distribution of a single continuous variable. Bar plots (barplot) and Box plots (boxplot) show the relationship between a categorical variable and a continuous variable.
| Plot Type | Purpose | Seaborn Function |
|---|---|---|
| Histogram | Distribution of 1 numeric col | sns.histplot(x='Age') |
| Bar Plot | Aggregation (Mean) per category | sns.barplot(x='Dept', y='Salary') |
| Box Plot | Spread and Outliers per category | sns.boxplot(x='Dept', y='Salary') |
๐ผ Why Data Analysts Care
โข Audience Demographics: Plotting a histogram of user ages to see if your app is popular with Gen Z or Boomers
โข A/B Testing: Using a boxplot to compare the spread of checkout times between Group A and Group B
๐ง Pro Tip
By default, sns.barplot calculates the Mean of the y-variable for each category and adds error bars! If you just want to count occurrences, use sns.countplot(x='Category') instead.
๐งช Concept Checks: Plot Types
Q1. Using the tips dataset, create a histogram of "total_bill" using sns.histplot().
Q2. Add the argument kde=True to the histogram to overlay a smooth density curve.
Q3. Create a Bar plot showing the average tip per day: x="day", y="tip". Use sns.barplot().
Q4. Create a Count plot to see how many transactions happened on each day: x="day". Use sns.countplot().
Q5. Create a Box plot of total_bill per day to spot outliers. sns.boxplot().
3. The Hue Parameter : Adding Dimensions
The most powerful feature in Seaborn is the hue parameter. It automatically groups your data by a categorical column and colors the plot elements accordingly. It turns a simple 2D plot into a 3D or 4D visualization with zero extra code.
# Instantly color-code the scatterplot by Gender
sns.scatterplot(data=df, x='Age', y='Salary', hue='Gender')
๐ผ Why Data Analysts Care
โข Deep Insights: A scatterplot might show a positive correlation. Adding hue='Region' might reveal that the correlation only exists in Europe!
โข Segmentation: Overlaying two histograms (e.g., Male vs Female heights) using hue to see where the distributions overlap
๐ง Pro Tip
Combine hue with col in sns.relplot() or sns.catplot() to automatically generate a grid of multiple subplots based on categorical variables (Facet Grids)!
๐งช Concept Checks: Hue & Colors
Q1. Create a scatterplot of total_bill vs tip. Add hue="smoker" to color the dots.
Q2. Create a Box plot of total_bill grouped by day. Add hue="sex" to split each box into two.
Q3. Create a Histogram of total_bill. Add hue="time" and multiple="stack" to create a stacked histogram.
Q4. Try adding style="time" along with hue="smoker" in a scatterplot. Observe how the dot shapes change.
Q5. Change the color palette by passing palette="husl" to any of the plots above.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (The Matrix): Load tips. Use tips.corr(numeric_only=True) to generate a correlation matrix. Pass this matrix into sns.heatmap(matrix, annot=True, cmap='coolwarm'). This is the most important plot in Machine Learning!
Task 2 (Pairplot): Run sns.pairplot(tips, hue='sex'). Warning: this takes a few seconds. It generates a scatterplot for EVERY combination of numeric variables automatically!
Task 3 (Time Series Plot): Create a fake Time Series DF with a Date column and Sales. Use sns.lineplot(data=df, x='Date', y='Sales'). Lineplots are best for chronological data.
Task 4 (Customizing Axes): Create a barplot. Before plt.show(), use plt.xticks(rotation=45) to rotate the X-axis labels so they don't overlap. Use plt.ylim(0, 100) to set the Y-axis limits.
Task 5 (Facet Grids): Use sns.catplot(data=tips, x='time', y='total_bill', col='day', kind='box'). Observe how it creates 4 separate plots (one for each day) side-by-side automatically.
๐ป Pure Coding Interview Questions
Q1.
What is the relationship between Seaborn and Matplotlib?
Q2.
Explain the difference between a Bar Plot and a Histogram.
Q3.
What do the 'whiskers' and 'dots' represent in a Box Plot? (Hint: IQR and outliers).
Q4.
Why is sns.countplot() often preferred over .groupby().size().plot.bar()?
Q5.
Explain the hue parameter in Seaborn. Why is it so powerful for EDA?
Q6.
What is a KDE (Kernel Density Estimate) curve?
Q7.
How do you resize a Seaborn plot? (Hint: plt.figure(figsize=(10, 6))).
Q8.
Write code to generate a Correlation Heatmap using Pandas and Seaborn.
Q9.
What is the difference between sns.scatterplot() and sns.relplot()?
Q10.
Explain what sns.pairplot() does. Why should you be careful using it on datasets with 100+ columns?
Q11.
How do you save a Seaborn plot to a .png file instead of displaying it? (plt.savefig()).
Q12.
What does the ci=None or errorbar=None parameter do in a Seaborn bar plot?
Q13.
How do you overlay two different plots (e.g., a lineplot on top of a barplot) in the same figure?
Q14.
Explain how to use plt.subplots() to create a 2x2 grid of distinct charts.
Q15.
What is a Violin plot (sns.violinplot)? How does it differ from a Box plot?
Q16.
How do you change the default color palette globally in Seaborn? (sns.set_palette()).
Q17.
Write code to plot a regression line through a scatterplot automatically. (sns.regplot() or lmplot).
Q18.
Why is it dangerous to rely solely on aggregate plots (like bar charts) without looking at distributions (like swarm plots)? (Hint: Anscombe's quartet).
Q19.
How do you rotate X-axis labels in Matplotlib if they are overlapping?
Q20.
Explain how to annotate a specific point on a plot with text. (plt.annotate()).
Q21.
What is a Swarm plot (sns.swarmplot)? When does it fail? (Hint: huge datasets).
Q22.
How do you set a logarithmic scale for the Y-axis? (plt.yscale('log')).
Q23.
What is the purpose of sns.set_theme()? What styles does it offer?
Q24.
Write code to create a dual-axis chart (two different Y-axes) using Matplotlib twinx().
Q25.
Explain the tradeoff between beautiful visualizations and data processing speed in a production pipeline.
๐ Day 29 Executive Summary
| # | Topic | Key Takeaway |
|---|---|---|
| 1 | EDA | Always visualize distributions before building ML models. |
| 2 | Boxplot | Instantly identifies outliers outside the Interquartile Range (IQR). |
| 3 | Hue | Groups and colors data by a category automatically. |
โ Instructor's End-of-Day Checklist
โข [ ] I can create a scatterplot and lineplot.
โข [ ] I can view data distributions using a histogram.
โข [ ] I can use the hue parameter to split data.