๐ Day 25 : Pandas Cleaning
๐ฏ Enterprise Objective
Data Analysts spend 80% of their time cleaning data. Missing values crash algorithms. Duplicates ruin financial aggregations. Strings masquerading as numbers prevent math. Today we master the unglamorous but critical art of Data Cleaning.
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | Missing Data | isna(), fillna() |
| 2 | Duplicates | drop_duplicates() |
| 3 | Types | astype(), .str accessor |
1. Handling Missing Data : The Reality of Data
Real-world data is messy. Missing values are represented as NaN (Not a Number) or None. Pandas provides methods to detect (isna()), remove (dropna()), or replace (fillna()) these missing values.
| Method | Action | Common Use |
|---|---|---|
df.isna() | Returns Boolean DF | Identifying where data is missing |
df.dropna() | Drops rows/cols | Removing rows where critical data is null |
df.fillna(val) | Replaces NaNs | Replacing missing ages with the median age |
๐ผ Why Data Analysts Care
โข Data Quality: Machine Learning models will crash if you feed them NaN values. You must handle them first!
โข Imputation: Filling missing salaries with the average salary of the employee's department
โ ๏ธ Blindly Dropping
Never run df.dropna() without thinking. If 30% of your data has a missing 'Phone Number', dropping those rows destroys 30% of your valid 'Email' and 'Name' data. Drop the column, or fill it with 'Unknown'.
๐งช Concept Checks: Missing Data
Q1. Run df.isna().sum() on your DataFrame to see which columns have missing values.
Q2. Use df.dropna(subset=["Age"]) to drop ONLY rows where "Age" is missing. Print it.
Q3. Fill all NaN values in the entire DataFrame with "Unknown" using df.fillna("Unknown").
Q4. Fill the "Salary" column with the median salary of that column using df["Salary"].fillna(...).
Q5. Use df.ffill() (forward fill) on a Series with a missing value. What does it do?
2. Handling Duplicates : Deduplication
Duplicate rows can skew aggregations and ruin analysis. df.duplicated() returns a boolean mask of duplicate rows, and df.drop_duplicates() removes them. You can specify whether to keep the 'first' occurrence, the 'last', or False (drop all duplicates completely).
# Keep only the first occurrence of a User_ID
df.drop_duplicates(subset=['User_ID'], keep='first')
๐ผ Why Data Analysts Care
โข ETL Pipelines: When pulling data daily, you might accidentally pull the same transactions twice. Deduplication fixes this
โข Latest Records: Sorting by 'Update_Date' and keeping the 'last' duplicate ensures you only have the most recent data for each user
๐ง Pro Tip
Always sort your DataFrame BEFORE dropping duplicates if you want to keep the highest/lowest/newest value. E.g., Sort by Date ascending, then drop_duplicates(keep='last').
๐งช Concept Checks: Duplicates
Q1. Create a DataFrame with 3 identical rows. Run df.duplicated() to see the boolean mask.
Q2. Run df.drop_duplicates() on the DataFrame from Q1 to clean it.
Q3. Given a DF with ID and Score, sort by Score descending, then drop duplicates on ID keeping the first.
Q4. What happens if you use keep=False in drop_duplicates()? Try it.
Q5. Count the number of duplicate IDs by chaining df.duplicated(subset=["ID"]).sum().
3. Data Type Conversion : Fixing Bad Formatting
When reading from CSVs, numbers might be loaded as strings (e.g., '1,000'). You cannot perform math on strings! The .astype() method converts columns to proper types. For strings requiring cleaning, the .str accessor is your best friend.
| Method | Action | Example |
|---|---|---|
astype(type) | Cast to type | df['Age'].astype(int) |
pd.to_numeric() | Safe casting | pd.to_numeric(df['Price'], errors='coerce') |
.str.replace() | String cleanup | df['Price'].str.replace('$', '') |
๐ผ Why Data Analysts Care
โข Currency Parsing: Converting '$1,200.50' into the float 1200.50 so you can sum it
โข Memory Optimization: Downcasting an int64 column to int8 to save 8x the RAM on a billion rows
๐ง Pro Tip
If .astype(float) fails because of a weird string like 'N/A', use pd.to_numeric(col, errors='coerce'). It will force the bad strings into NaN so you can proceed!
๐งช Concept Checks: Data Types
Q1. Given df["Age"] = ["25", "30"] (strings), cast it to integer using .astype(int).
Q2. Given df["Cost"] = ["1,000", "2,500"], remove the comma using .str.replace(",", "").
Q3. Cast the cleaned Cost column to float.
Q4. Use pd.to_numeric(..., errors="coerce") on ["10", "bad", "20"]. Print the result.
Q5. Check the memory usage of a column using df["Col"].memory_usage(). Cast it to float32 and check again.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (The Complete Pipeline): Create a dirty DataFrame: {'ID': [1,2,2,3], 'Price': ['$10', '$20', '$20', 'NaN'], 'Qty': [1,2,np.nan,3]}. Write a 4-step pipeline: 1) Drop exact duplicates. 2) Clean Price '$' and cast to float. 3) Fill missing Qty with 1. 4) Calculate Total = Price * Qty.
Task 2 (Threshold Dropping): Create a DF with 5 rows and 3 columns of NaNs/data. Use df.dropna(thresh=2) to drop rows that do not have AT LEAST 2 valid non-NaN values. Print it.
Task 3 (Categorical Conversion): Given a column df['Size'] = ['S', 'M', 'L', 'S', 'M'] with 100,000 rows. Convert its type to 'category' using .astype('category'). Check df.info() to see memory savings.
Task 4 (String Extraction): Given df['Code'] = ['Item-123', 'Item-456']. Use .str.split('-').str[1] to extract just the numbers. Cast them to integers.
Task 5 (Interpolation): Given a Time Series df['Temp'] = [20, np.nan, np.nan, 26]. Use df['Temp'].interpolate() to fill the missing values with a linear progression. Print it.
๐ป Pure Coding Interview Questions
Q1.
Explain the difference between NaN, None, and NaT in Pandas.
Q2.
What does the errors='coerce' argument do in pd.to_numeric()?
Q3.
How do you drop columns that contain MORE than 50% missing values?
Q4.
Explain the thresh parameter in df.dropna().
Q5.
Why might filling missing values with the Mean be a bad idea for highly skewed data?
Q6.
Write code to fill missing values in column 'A' with the median of column 'A'.
Q7.
How do you perform a Forward Fill (ffill)? In what scenario (e.g., time series) is it useful?
Q8.
What is the difference between df.drop_duplicates() and df.duplicated()?
Q9.
Write code to find the number of exact duplicate rows in a DataFrame.
Q10.
How do you keep the LAST occurrence of a duplicate based on a specific subset of columns?
Q11.
Explain how converting a string column with low cardinality to category saves memory.
Q12.
Write code to remove all whitespace from the beginning and end of a string column using .str.strip().
Q13.
How do you extract a substring from a column using a regular expression in Pandas? (.str.extract()).
Q14.
What happens if you use .astype(int) on a column that contains NaN values? (Hint: It fails).
Q15.
Explain the Int64 (capital I) nullable integer data type introduced in newer Pandas versions.
Q16.
Write code to replace all negative values in a DataFrame with NaN using df.where() or np.where().
Q17.
How do you rename the index of a DataFrame?
Q18.
What is Data Imputation? Name two advanced methods beyond simple mean/median filling (e.g., KNN, Regression).
Q19.
Write code to convert a column containing 'Yes'/'No' strings into boolean True/False.
Q20.
How do you parse a column of dates in the format 'YYYY/MM/DD' into Pandas datetime objects?
Q21.
Explain the inplace=True argument. Why is it generally avoided by Pandas core devs?
Q22.
Write code to calculate the percentage of missing values in every column of a DataFrame.
Q23.
How do you apply a custom data-cleaning function to every element in a single column?
Q24.
Explain the difference between .map(), .apply(), and .applymap() in Pandas.
Q25.
What is the performance implication of using .str accessors on a column of 10 million rows?
๐ Day 25 Executive Summary
| # | Topic | Key Takeaway |
|---|---|---|
| 1 | NaNs | Never drop NaNs blindly. Think about why they are missing. |
| 2 | Dedupe | Sort by date first to ensure you keep the latest record. |
| 3 | Casting | Use pd.to_numeric(..., errors='coerce') for messy strings. |
โ Instructor's End-of-Day Checklist
โข [ ] I can find and fill missing values.
โข [ ] I can drop duplicates while keeping the latest record.
โข [ ] I can clean strings and cast them to floats.