Day 25: Pandas Cleaning

🎯 Enterprise Objective

Data Analysts spend 80% of their time cleaning data. Missing values crash algorithms. Duplicates ruin financial aggregations. Strings masquerading as numbers prevent math. Today we master the unglamorous but critical art of Data Cleaning.

📋 Strategic Overview

#	Topic	Concept
1	Missing Data	`isna()`, `fillna()`
2	Duplicates	`drop_duplicates()`
3	Types	`astype()`, `.str` accessor

1. Handling Missing Data : The Reality of Data

🔍 What is it?

Real-world data is messy. Missing values are represented as NaN (Not a Number) or None. Pandas provides methods to detect (isna()), remove (dropna()), or replace (fillna()) these missing values.

Method	Action	Common Use
`df.isna()`	Returns Boolean DF	Identifying where data is missing
`df.dropna()`	Drops rows/cols	Removing rows where critical data is null
`df.fillna(val)`	Replaces NaNs	Replacing missing ages with the median age

💼 Why Data Analysts Care

• Data Quality: Machine Learning models will crash if you feed them NaN values. You must handle them first!

• Imputation: Filling missing salaries with the average salary of the employee's department

⚠️ Blindly Dropping

Never run df.dropna() without thinking. If 30% of your data has a missing 'Phone Number', dropping those rows destroys 30% of your valid 'Email' and 'Name' data. Drop the column, or fill it with 'Unknown'.

In [ ]:

🧪 Concept Checks: `Missing Data`

Q1. Run df.isna().sum() on your DataFrame to see which columns have missing values.

In [ ]:

Q2. Use df.dropna(subset=["Age"]) to drop ONLY rows where "Age" is missing. Print it.

In [ ]:

Q3. Fill all NaN values in the entire DataFrame with "Unknown" using df.fillna("Unknown").

In [ ]:

Q4. Fill the "Salary" column with the median salary of that column using df["Salary"].fillna(...).

In [ ]:

Q5. Use df.ffill() (forward fill) on a Series with a missing value. What does it do?

In [ ]:

2. Handling Duplicates : Deduplication

🔍 What is it?

Duplicate rows can skew aggregations and ruin analysis. df.duplicated() returns a boolean mask of duplicate rows, and df.drop_duplicates() removes them. You can specify whether to keep the 'first' occurrence, the 'last', or False (drop all duplicates completely).

# Keep only the first occurrence of a User_ID
df.drop_duplicates(subset=['User_ID'], keep='first')

💼 Why Data Analysts Care

• ETL Pipelines: When pulling data daily, you might accidentally pull the same transactions twice. Deduplication fixes this

• Latest Records: Sorting by 'Update_Date' and keeping the 'last' duplicate ensures you only have the most recent data for each user

🧠 Pro Tip

Always sort your DataFrame BEFORE dropping duplicates if you want to keep the highest/lowest/newest value. E.g., Sort by Date ascending, then drop_duplicates(keep='last').

In [ ]:

🧪 Concept Checks: `Duplicates`

Q1. Create a DataFrame with 3 identical rows. Run df.duplicated() to see the boolean mask.

In [ ]:

Q2. Run df.drop_duplicates() on the DataFrame from Q1 to clean it.

In [ ]:

Q3. Given a DF with ID and Score, sort by Score descending, then drop duplicates on ID keeping the first.

In [ ]:

Q4. What happens if you use keep=False in drop_duplicates()? Try it.

In [ ]:

Q5. Count the number of duplicate IDs by chaining df.duplicated(subset=["ID"]).sum().

In [ ]:

3. Data Type Conversion : Fixing Bad Formatting

🔍 What is it?

When reading from CSVs, numbers might be loaded as strings (e.g., '1,000'). You cannot perform math on strings! The .astype() method converts columns to proper types. For strings requiring cleaning, the .str accessor is your best friend.

Method	Action	Example
`astype(type)`	Cast to type	`df['Age'].astype(int)`
`pd.to_numeric()`	Safe casting	`pd.to_numeric(df['Price'], errors='coerce')`
`.str.replace()`	String cleanup	`df['Price'].str.replace('$', '')`

💼 Why Data Analysts Care

• Currency Parsing: Converting '$1,200.50' into the float 1200.50 so you can sum it

• Memory Optimization: Downcasting an int64 column to int8 to save 8x the RAM on a billion rows

🧠 Pro Tip

If .astype(float) fails because of a weird string like 'N/A', use pd.to_numeric(col, errors='coerce'). It will force the bad strings into NaN so you can proceed!

In [ ]:

🧪 Concept Checks: `Data Types`

Q1. Given df["Age"] = ["25", "30"] (strings), cast it to integer using .astype(int).

In [ ]:

Q2. Given df["Cost"] = ["1,000", "2,500"], remove the comma using .str.replace(",", "").

In [ ]:

Q3. Cast the cleaned Cost column to float.

In [ ]:

Q4. Use pd.to_numeric(..., errors="coerce") on ["10", "bad", "20"]. Print the result.

In [ ]:

Q5. Check the memory usage of a column using df["Col"].memory_usage(). Cast it to float32 and check again.

In [ ]:

🛠️ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (The Complete Pipeline): Create a dirty DataFrame: {'ID': [1,2,2,3], 'Price': ['$10', '$20', '$20', 'NaN'], 'Qty': [1,2,np.nan,3]}. Write a 4-step pipeline: 1) Drop exact duplicates. 2) Clean Price '$' and cast to float. 3) Fill missing Qty with 1. 4) Calculate Total = Price * Qty.

In [ ]:

Task 2 (Threshold Dropping): Create a DF with 5 rows and 3 columns of NaNs/data. Use df.dropna(thresh=2) to drop rows that do not have AT LEAST 2 valid non-NaN values. Print it.

In [ ]:

Task 3 (Categorical Conversion): Given a column df['Size'] = ['S', 'M', 'L', 'S', 'M'] with 100,000 rows. Convert its type to 'category' using .astype('category'). Check df.info() to see memory savings.

In [ ]:

Task 4 (String Extraction): Given df['Code'] = ['Item-123', 'Item-456']. Use .str.split('-').str[1] to extract just the numbers. Cast them to integers.

In [ ]:

Task 5 (Interpolation): Given a Time Series df['Temp'] = [20, np.nan, np.nan, 26]. Use df['Temp'].interpolate() to fill the missing values with a linear progression. Print it.

In [ ]:

💻 Pure Coding Interview Questions

Q1.

Explain the difference between `NaN`, `None`, and `NaT` in Pandas.

In [ ]:

Q2.

What does the `errors='coerce'` argument do in `pd.to_numeric()`?

In [ ]:

Q3.

How do you drop columns that contain MORE than 50% missing values?

In [ ]:

Q4.

Explain the `thresh` parameter in `df.dropna()`.

In [ ]:

Q5.

Why might filling missing values with the Mean be a bad idea for highly skewed data?

In [ ]:

Q6.

Write code to fill missing values in column 'A' with the median of column 'A'.

In [ ]:

Q7.

How do you perform a Forward Fill (`ffill`)? In what scenario (e.g., time series) is it useful?

In [ ]:

Q8.

What is the difference between `df.drop_duplicates()` and `df.duplicated()`?

In [ ]:

Q9.

Write code to find the number of exact duplicate rows in a DataFrame.

In [ ]:

Q10.

How do you keep the LAST occurrence of a duplicate based on a specific subset of columns?

In [ ]:

Q11.

Explain how converting a string column with low cardinality to `category` saves memory.

In [ ]:

Q12.

Write code to remove all whitespace from the beginning and end of a string column using `.str.strip()`.

In [ ]:

Q13.

How do you extract a substring from a column using a regular expression in Pandas? (`.str.extract()`).

In [ ]:

Q14.

What happens if you use `.astype(int)` on a column that contains `NaN` values? (Hint: It fails).

In [ ]:

Q15.

Explain the `Int64` (capital I) nullable integer data type introduced in newer Pandas versions.

In [ ]:

Q16.

Write code to replace all negative values in a DataFrame with `NaN` using `df.where()` or `np.where()`.

In [ ]:

Q17.

How do you rename the index of a DataFrame?

In [ ]:

Q18.

What is Data Imputation? Name two advanced methods beyond simple mean/median filling (e.g., KNN, Regression).

In [ ]:

Q19.

Write code to convert a column containing 'Yes'/'No' strings into boolean `True`/`False`.

In [ ]:

Q20.

How do you parse a column of dates in the format 'YYYY/MM/DD' into Pandas datetime objects?

In [ ]:

Q21.

Explain the `inplace=True` argument. Why is it generally avoided by Pandas core devs?

In [ ]:

Q22.

Write code to calculate the percentage of missing values in every column of a DataFrame.

In [ ]:

Q23.

How do you apply a custom data-cleaning function to every element in a single column?

In [ ]:

Q24.

Explain the difference between `.map()`, `.apply()`, and `.applymap()` in Pandas.

In [ ]:

Q25.

What is the performance implication of using `.str` accessors on a column of 10 million rows?

In [ ]:

📊 Day 25 Executive Summary

#	Topic	Key Takeaway
1	NaNs	Never drop NaNs blindly. Think about why they are missing.
2	Dedupe	Sort by date first to ensure you keep the latest record.
3	Casting	Use `pd.to_numeric(..., errors='coerce')` for messy strings.

✅ Instructor's End-of-Day Checklist

• [ ] I can find and fill missing values.

• [ ] I can drop duplicates while keeping the latest record.

• [ ] I can clean strings and cast them to floats.

🚀 Continue to Day 26 - Pandas GroupBy & Aggregations

📊 Day 25 : Pandas Cleaning

🎯 Enterprise Objective

📋 Strategic Overview

1. Handling Missing Data : The Reality of Data

💼 Why Data Analysts Care

⚠️ Blindly Dropping

🧪 Concept Checks: Missing Data

2. Handling Duplicates : Deduplication

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Duplicates

3. Data Type Conversion : Fixing Bad Formatting

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Data Types

🛠️ Professional Practice Tasks

💻 Pure Coding Interview Questions

Explain the difference between NaN, None, and NaT in Pandas.

What does the errors='coerce' argument do in pd.to_numeric()?

How do you drop columns that contain MORE than 50% missing values?

Explain the thresh parameter in df.dropna().

Why might filling missing values with the Mean be a bad idea for highly skewed data?

Write code to fill missing values in column 'A' with the median of column 'A'.

How do you perform a Forward Fill (ffill)? In what scenario (e.g., time series) is it useful?

What is the difference between df.drop_duplicates() and df.duplicated()?

Write code to find the number of exact duplicate rows in a DataFrame.

How do you keep the LAST occurrence of a duplicate based on a specific subset of columns?

Explain how converting a string column with low cardinality to category saves memory.

Write code to remove all whitespace from the beginning and end of a string column using .str.strip().

How do you extract a substring from a column using a regular expression in Pandas? (.str.extract()).

What happens if you use .astype(int) on a column that contains NaN values? (Hint: It fails).

Explain the Int64 (capital I) nullable integer data type introduced in newer Pandas versions.

Write code to replace all negative values in a DataFrame with NaN using df.where() or np.where().

How do you rename the index of a DataFrame?

What is Data Imputation? Name two advanced methods beyond simple mean/median filling (e.g., KNN, Regression).

Write code to convert a column containing 'Yes'/'No' strings into boolean True/False.

How do you parse a column of dates in the format 'YYYY/MM/DD' into Pandas datetime objects?

Explain the inplace=True argument. Why is it generally avoided by Pandas core devs?

Write code to calculate the percentage of missing values in every column of a DataFrame.

How do you apply a custom data-cleaning function to every element in a single column?

Explain the difference between .map(), .apply(), and .applymap() in Pandas.

What is the performance implication of using .str accessors on a column of 10 million rows?

📊 Day 25 Executive Summary

✅ Instructor's End-of-Day Checklist

🧪 Concept Checks: `Missing Data`

🧪 Concept Checks: `Duplicates`

🧪 Concept Checks: `Data Types`

Explain the difference between `NaN`, `None`, and `NaT` in Pandas.

What does the `errors='coerce'` argument do in `pd.to_numeric()`?

Explain the `thresh` parameter in `df.dropna()`.

How do you perform a Forward Fill (`ffill`)? In what scenario (e.g., time series) is it useful?

What is the difference between `df.drop_duplicates()` and `df.duplicated()`?

Explain how converting a string column with low cardinality to `category` saves memory.

Write code to remove all whitespace from the beginning and end of a string column using `.str.strip()`.

How do you extract a substring from a column using a regular expression in Pandas? (`.str.extract()`).

What happens if you use `.astype(int)` on a column that contains `NaN` values? (Hint: It fails).

Explain the `Int64` (capital I) nullable integer data type introduced in newer Pandas versions.

Write code to replace all negative values in a DataFrame with `NaN` using `df.where()` or `np.where()`.

Write code to convert a column containing 'Yes'/'No' strings into boolean `True`/`False`.

Explain the `inplace=True` argument. Why is it generally avoided by Pandas core devs?

Explain the difference between `.map()`, `.apply()`, and `.applymap()` in Pandas.

What is the performance implication of using `.str` accessors on a column of 10 million rows?