๐ Day 27 : Pandas Merging
๐ฏ Enterprise Objective
Data rarely lives in a single CSV file. To answer complex business questions, you must connect relational datasets. Today we master SQL-style Joins via pd.merge(), row stacking via pd.concat(), and index alignment via df.join().
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | Merging | INNER, LEFT, OUTER Joins |
| 2 | Concat | Stacking rows vertically |
| 3 | Joining | Aligning by Index |
1. Merging (Joins) : Connecting Datasets
In the real world, data is stored in multiple tables. pd.merge() is the exact equivalent of a SQL JOIN. It connects two DataFrames side-by-side based on a common column (the 'key').
| Merge Type (how=) | SQL Equivalent | Result |
|---|---|---|
inner (Default) | INNER JOIN | Keep only rows with matching keys in BOTH tables |
left | LEFT JOIN | Keep all rows from Left table, fill missing right data with NaN |
right | RIGHT JOIN | Keep all rows from Right table |
outer | FULL OUTER JOIN | Keep all rows from both tables, filling NaNs wherever needed |
๐ผ Why Data Analysts Care
โข Enrichment: You have a table of User_IDs. You merge it with a Users table to get their Email and Name.
โ ๏ธ Exploding Joins
If you merge on a column that has duplicates in BOTH tables (a Many-to-Many join), Pandas will create a Cartesian product, multiplying your row count exponentially and crashing your RAM. Ensure your keys are unique in at least one table!
๐งช Concept Checks: Merging
Q1. Create two DataFrames with a common column "ID". Merge them using how="inner".
Q2. Change the merge to how="left". Observe the NaN values for the unmatched row.
Q3. Change the merge to how="outer". Observe that all IDs from both tables are present.
Q4. What happens if the columns have different names? Use left_on="Txn_ID" and right_on="User_ID" to merge them.
Q5. Explain the risk of Many-to-Many joins. What happens to row counts?
2. Concatenation : Stacking Datasets
While merge connects tables side-by-side, pd.concat() stacks tables on top of each other (or side-by-side if you change the axis). It is the equivalent of a SQL UNION. You use it when you have identical schema files (e.g., Jan Sales, Feb Sales) and want one master table.
# Stack dataframes vertically (Rows increase)
master_df = pd.concat([jan_df, feb_df, mar_df], axis=0, ignore_index=True)
๐ผ Why Data Analysts Care
โข Batch Loading: Using a for loop to read 50 CSV files into a list, then running pd.concat(list) to create one massive DataFrame.
๐ง Pro Tip
Always use ignore_index=True when concatenating vertically. Otherwise, the new DataFrame will have duplicate index values (e.g., 0,1,2... 0,1,2...), which will cause nightmares when you try to use .loc later.
๐งช Concept Checks: Concat
Q1. Create two DataFrames with identical columns. Use pd.concat([df1, df2]) to stack them.
Q2. Notice the index duplicates. Add ignore_index=True to fix it. Print the result.
Q3. Change to axis=1 to concatenate side-by-side. Print the result.
Q4. What happens if df1 has a column "A" and df2 has a column "B" and you concat vertically? (Hint: NaNs are generated).
Q5. Append a single row (as a DataFrame) to df1 using pd.concat.
3. Joining on Index : Index Alignment
While merge looks at columns, the .join() method specifically aligns DataFrames based on their Index. It is a shortcut for pd.merge(left_index=True, right_index=True).
# df1 and df2 must share the same index labels
combined = df1.join(df2, how='left')
๐ผ Why Data Analysts Care
โข Time Series: Joining multiple stock price DataFrames where the Date is the index for all of them.
๐ง Pro Tip
If both DataFrames have a column with the exact same name, .join() will crash. You must provide lsuffix='_left' and rsuffix='_right' to differentiate them.
๐งช Concept Checks: Index Join
Q1. Create two DataFrames with matching indexes ["a", "b"]. Use df1.join(df2). Print it.
Q2. Create a column collision: both DFs have a column "Val". Try to join. Catch the ValueError.
Q3. Fix Q2 by adding lsuffix="_left", rsuffix="_right". Print the result.
Q4. What is the pd.merge() equivalent of an index join? (Hint: left_index=True, right_index=True).
Q5. Explain why Time Series analysis heavily relies on index joining.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (Employee Database): Create employees DF (EmpID, Name, DeptID) and departments DF (DeptID, DeptName). Use pd.merge() to create a complete table showing Employee Names and their Department Names.
Task 2 (Missing Matches): Create a Left Merge from the task above, but ensure one EmpID has a DeptID that doesn't exist in the departments table. Use isna() on the merged DF to find the employee with no matching department.
Task 3 (File Aggregator): Write a simulated loop: Create 3 identical DataFrames representing Jan, Feb, Mar sales. Append them to a python list. Use pd.concat(list, ignore_index=True) to create the master DataFrame.
Task 4 (Mismatch Concat): Create df1 with columns ['A', 'B'] and df2 with columns ['B', 'C']. Concat them vertically. Notice how Pandas fills missing columns with NaN. Use join='inner' inside concat to keep ONLY shared columns.
Task 5 (Cross Join): Create a DataFrame of Colors and Sizes. Use pd.merge(how='cross') to generate every possible combination of Color and Size (Cartesian product).
๐ป Pure Coding Interview Questions
Q1.
Explain the difference between pd.merge(), pd.concat(), and df.join().
Q2.
What is the difference between an INNER JOIN and a LEFT JOIN?
Q3.
When merging, how do you handle columns that have different names in the left and right tables? (Hint: left_on, right_on).
Q4.
What happens in a Pandas merge if both tables have a column named 'Status' that is NOT the join key? (Hint: suffixes).
Q5.
Explain what a Many-to-Many merge is and why it can cause MemoryErrors.
Q6.
How do you merge a DataFrame on its Index with another DataFrame on a specific Column? (left_index=True, right_on='Col').
Q7.
What does pd.concat(axis=1) do? When would you use it?
Q8.
Why is ignore_index=True critical when using pd.concat(axis=0)?
Q9.
Explain the join='inner' vs join='outer' parameter inside pd.concat().
Q10.
Write code to concatenate a list of 100 CSV files into a single DataFrame efficiently.
Q11.
What is a CROSS JOIN? How do you perform it in Pandas?
Q12.
How do you find rows that exist in the Left table but NOT in the Right table? (Anti-Join using indicator=True).
Q13.
Explain what the indicator=True argument does in pd.merge().
Q14.
How do you merge on multiple columns simultaneously? (on=['Key1', 'Key2']).
Q15.
What is pd.merge_asof()? Why is it crucial for financial tick data or time-series?
Q16.
Explain the performance differences between .merge() and .join().
Q17.
Write code to append a single dictionary as a new row to a DataFrame. Why is df.loc[len(df)] or concat better than .append() (which is deprecated)?
Q18.
How do you merge DataFrames with overlapping columns but selectively choose which columns to keep from the right table?
Q19.
What happens to the data types of integer columns if a merge introduces NaN values? (Hint: upcasts to float).
Q20.
Explain how the new nullable integer type Int64 solves the NaN float upcasting issue in merges.
Q21.
Write code to validate a merge (e.g., ensuring it's strictly One-to-Many) using the validate='1:m' argument.
Q22.
How does Pandas handle matching keys that have different data types (e.g., 1 as int vs '1' as string)?
Q23.
What is the difference between df.update() and pd.merge()?
Q24.
Explain df.combine_first(). When would you use it instead of a merge?
Q25.
Design a pipeline that merges 3 different relational tables (Users, Orders, Products) into one flat denormalized DataFrame.
๐ Day 27 Executive Summary
| # | Topic | Key Takeaway |
|---|---|---|
| 1 | Merge | Exact equivalent to SQL joins. Maps columns to columns. |
| 2 | Concat | Equivalent to SQL UNION. Stacks tables. Always use ignore_index=True. |
| 3 | Indicator | Use indicator=True in merges to debug where data came from. |
โ Instructor's End-of-Day Checklist
โข [ ] I can perform a LEFT JOIN using pd.merge().
โข [ ] I can stack DataFrames vertically using pd.concat().
โข [ ] I understand the difference between Merge and Concat.