Day 27: Pandas Merging

🎯 Enterprise Objective

Data rarely lives in a single CSV file. To answer complex business questions, you must connect relational datasets. Today we master SQL-style Joins via pd.merge(), row stacking via pd.concat(), and index alignment via df.join().

📋 Strategic Overview

#	Topic	Concept
1	Merging	`INNER`, `LEFT`, `OUTER` Joins
2	Concat	Stacking rows vertically
3	Joining	Aligning by Index

1. Merging (Joins) : Connecting Datasets

🔍 What is it?

In the real world, data is stored in multiple tables. pd.merge() is the exact equivalent of a SQL JOIN. It connects two DataFrames side-by-side based on a common column (the 'key').

Merge Type (how=)	SQL Equivalent	Result
`inner` (Default)	INNER JOIN	Keep only rows with matching keys in BOTH tables
`left`	LEFT JOIN	Keep all rows from Left table, fill missing right data with NaN
`right`	RIGHT JOIN	Keep all rows from Right table
`outer`	FULL OUTER JOIN	Keep all rows from both tables, filling NaNs wherever needed

💼 Why Data Analysts Care

• Enrichment: You have a table of User_IDs. You merge it with a Users table to get their Email and Name.

⚠️ Exploding Joins

If you merge on a column that has duplicates in BOTH tables (a Many-to-Many join), Pandas will create a Cartesian product, multiplying your row count exponentially and crashing your RAM. Ensure your keys are unique in at least one table!

In [ ]:

🧪 Concept Checks: `Merging`

Q1. Create two DataFrames with a common column "ID". Merge them using how="inner".

In [ ]:

Q2. Change the merge to how="left". Observe the NaN values for the unmatched row.

In [ ]:

Q3. Change the merge to how="outer". Observe that all IDs from both tables are present.

In [ ]:

Q4. What happens if the columns have different names? Use left_on="Txn_ID" and right_on="User_ID" to merge them.

In [ ]:

Q5. Explain the risk of Many-to-Many joins. What happens to row counts?

In [ ]:

2. Concatenation : Stacking Datasets

🔍 What is it?

While merge connects tables side-by-side, pd.concat() stacks tables on top of each other (or side-by-side if you change the axis). It is the equivalent of a SQL UNION. You use it when you have identical schema files (e.g., Jan Sales, Feb Sales) and want one master table.

# Stack dataframes vertically (Rows increase)
master_df = pd.concat([jan_df, feb_df, mar_df], axis=0, ignore_index=True)

💼 Why Data Analysts Care

• Batch Loading: Using a for loop to read 50 CSV files into a list, then running pd.concat(list) to create one massive DataFrame.

🧠 Pro Tip

Always use ignore_index=True when concatenating vertically. Otherwise, the new DataFrame will have duplicate index values (e.g., 0,1,2... 0,1,2...), which will cause nightmares when you try to use .loc later.

In [ ]:

🧪 Concept Checks: `Concat`

Q1. Create two DataFrames with identical columns. Use pd.concat([df1, df2]) to stack them.

In [ ]:

Q2. Notice the index duplicates. Add ignore_index=True to fix it. Print the result.

In [ ]:

Q3. Change to axis=1 to concatenate side-by-side. Print the result.

In [ ]:

Q4. What happens if df1 has a column "A" and df2 has a column "B" and you concat vertically? (Hint: NaNs are generated).

In [ ]:

Q5. Append a single row (as a DataFrame) to df1 using pd.concat.

In [ ]:

3. Joining on Index : Index Alignment

🔍 What is it?

While merge looks at columns, the .join() method specifically aligns DataFrames based on their Index. It is a shortcut for pd.merge(left_index=True, right_index=True).

# df1 and df2 must share the same index labels
combined = df1.join(df2, how='left')

💼 Why Data Analysts Care

• Time Series: Joining multiple stock price DataFrames where the Date is the index for all of them.

🧠 Pro Tip

If both DataFrames have a column with the exact same name, .join() will crash. You must provide lsuffix='_left' and rsuffix='_right' to differentiate them.

In [ ]:

🧪 Concept Checks: `Index Join`

Q1. Create two DataFrames with matching indexes ["a", "b"]. Use df1.join(df2). Print it.

In [ ]:

Q2. Create a column collision: both DFs have a column "Val". Try to join. Catch the ValueError.

In [ ]:

Q3. Fix Q2 by adding lsuffix="_left", rsuffix="_right". Print the result.

In [ ]:

Q4. What is the pd.merge() equivalent of an index join? (Hint: left_index=True, right_index=True).

In [ ]:

Q5. Explain why Time Series analysis heavily relies on index joining.

In [ ]:

🛠️ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Employee Database): Create employees DF (EmpID, Name, DeptID) and departments DF (DeptID, DeptName). Use pd.merge() to create a complete table showing Employee Names and their Department Names.

In [ ]:

Task 2 (Missing Matches): Create a Left Merge from the task above, but ensure one EmpID has a DeptID that doesn't exist in the departments table. Use isna() on the merged DF to find the employee with no matching department.

In [ ]:

Task 3 (File Aggregator): Write a simulated loop: Create 3 identical DataFrames representing Jan, Feb, Mar sales. Append them to a python list. Use pd.concat(list, ignore_index=True) to create the master DataFrame.

In [ ]:

Task 4 (Mismatch Concat): Create df1 with columns ['A', 'B'] and df2 with columns ['B', 'C']. Concat them vertically. Notice how Pandas fills missing columns with NaN. Use join='inner' inside concat to keep ONLY shared columns.

In [ ]:

Task 5 (Cross Join): Create a DataFrame of Colors and Sizes. Use pd.merge(how='cross') to generate every possible combination of Color and Size (Cartesian product).

In [ ]:

💻 Pure Coding Interview Questions

Q1.

Explain the difference between `pd.merge()`, `pd.concat()`, and `df.join()`.

In [ ]:

Q2.

What is the difference between an INNER JOIN and a LEFT JOIN?

In [ ]:

Q3.

When merging, how do you handle columns that have different names in the left and right tables? (Hint: `left_on`, `right_on`).

In [ ]:

Q4.

What happens in a Pandas merge if both tables have a column named 'Status' that is NOT the join key? (Hint: suffixes).

In [ ]:

Q5.

Explain what a Many-to-Many merge is and why it can cause MemoryErrors.

In [ ]:

Q6.

How do you merge a DataFrame on its Index with another DataFrame on a specific Column? (`left_index=True, right_on='Col'`).

In [ ]:

Q7.

What does `pd.concat(axis=1)` do? When would you use it?

In [ ]:

Q8.

Why is `ignore_index=True` critical when using `pd.concat(axis=0)`?

In [ ]:

Q9.

Explain the `join='inner'` vs `join='outer'` parameter inside `pd.concat()`.

In [ ]:

Q10.

Write code to concatenate a list of 100 CSV files into a single DataFrame efficiently.

In [ ]:

Q11.

What is a CROSS JOIN? How do you perform it in Pandas?

In [ ]:

Q12.

How do you find rows that exist in the Left table but NOT in the Right table? (Anti-Join using `indicator=True`).

In [ ]:

Q13.

Explain what the `indicator=True` argument does in `pd.merge()`.

In [ ]:

Q14.

How do you merge on multiple columns simultaneously? (`on=['Key1', 'Key2']`).

In [ ]:

Q15.

What is `pd.merge_asof()`? Why is it crucial for financial tick data or time-series?

In [ ]:

Q16.

Explain the performance differences between `.merge()` and `.join()`.

In [ ]:

Q17.

Write code to append a single dictionary as a new row to a DataFrame. Why is `df.loc[len(df)]` or `concat` better than `.append()` (which is deprecated)?

In [ ]:

Q18.

How do you merge DataFrames with overlapping columns but selectively choose which columns to keep from the right table?

In [ ]:

Q19.

What happens to the data types of integer columns if a merge introduces `NaN` values? (Hint: upcasts to float).

In [ ]:

Q20.

Explain how the new nullable integer type `Int64` solves the NaN float upcasting issue in merges.

In [ ]:

Q21.

Write code to validate a merge (e.g., ensuring it's strictly One-to-Many) using the `validate='1:m'` argument.

In [ ]:

Q22.

How does Pandas handle matching keys that have different data types (e.g., `1` as int vs `'1'` as string)?

In [ ]:

Q23.

What is the difference between `df.update()` and `pd.merge()`?

In [ ]:

Q24.

Explain `df.combine_first()`. When would you use it instead of a merge?

In [ ]:

Q25.

Design a pipeline that merges 3 different relational tables (Users, Orders, Products) into one flat denormalized DataFrame.

In [ ]:

📊 Day 27 Executive Summary

#	Topic	Key Takeaway
1	Merge	Exact equivalent to SQL joins. Maps columns to columns.
2	Concat	Equivalent to SQL UNION. Stacks tables. Always use `ignore_index=True`.
3	Indicator	Use `indicator=True` in merges to debug where data came from.

✅ Instructor's End-of-Day Checklist

• [ ] I can perform a LEFT JOIN using pd.merge().

• [ ] I can stack DataFrames vertically using pd.concat().

• [ ] I understand the difference between Merge and Concat.

🚀 Continue to Day 28 - Pandas Series & Time Series Analysis

📊 Day 27 : Pandas Merging

🎯 Enterprise Objective

📋 Strategic Overview

1. Merging (Joins) : Connecting Datasets

💼 Why Data Analysts Care

⚠️ Exploding Joins

🧪 Concept Checks: Merging

2. Concatenation : Stacking Datasets

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Concat

3. Joining on Index : Index Alignment

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Index Join

🛠️ Professional Practice Tasks

💻 Pure Coding Interview Questions

Explain the difference between pd.merge(), pd.concat(), and df.join().

What is the difference between an INNER JOIN and a LEFT JOIN?

When merging, how do you handle columns that have different names in the left and right tables? (Hint: left_on, right_on).

What happens in a Pandas merge if both tables have a column named 'Status' that is NOT the join key? (Hint: suffixes).

Explain what a Many-to-Many merge is and why it can cause MemoryErrors.

How do you merge a DataFrame on its Index with another DataFrame on a specific Column? (left_index=True, right_on='Col').

What does pd.concat(axis=1) do? When would you use it?

Why is ignore_index=True critical when using pd.concat(axis=0)?

Explain the join='inner' vs join='outer' parameter inside pd.concat().

Write code to concatenate a list of 100 CSV files into a single DataFrame efficiently.

What is a CROSS JOIN? How do you perform it in Pandas?

How do you find rows that exist in the Left table but NOT in the Right table? (Anti-Join using indicator=True).

Explain what the indicator=True argument does in pd.merge().

How do you merge on multiple columns simultaneously? (on=['Key1', 'Key2']).

What is pd.merge_asof()? Why is it crucial for financial tick data or time-series?

Explain the performance differences between .merge() and .join().

Write code to append a single dictionary as a new row to a DataFrame. Why is df.loc[len(df)] or concat better than .append() (which is deprecated)?

How do you merge DataFrames with overlapping columns but selectively choose which columns to keep from the right table?

What happens to the data types of integer columns if a merge introduces NaN values? (Hint: upcasts to float).

Explain how the new nullable integer type Int64 solves the NaN float upcasting issue in merges.

Write code to validate a merge (e.g., ensuring it's strictly One-to-Many) using the validate='1:m' argument.

How does Pandas handle matching keys that have different data types (e.g., 1 as int vs '1' as string)?

What is the difference between df.update() and pd.merge()?

Explain df.combine_first(). When would you use it instead of a merge?

Design a pipeline that merges 3 different relational tables (Users, Orders, Products) into one flat denormalized DataFrame.

📊 Day 27 Executive Summary

✅ Instructor's End-of-Day Checklist

🧪 Concept Checks: `Merging`

🧪 Concept Checks: `Concat`

🧪 Concept Checks: `Index Join`

Explain the difference between `pd.merge()`, `pd.concat()`, and `df.join()`.

When merging, how do you handle columns that have different names in the left and right tables? (Hint: `left_on`, `right_on`).

How do you merge a DataFrame on its Index with another DataFrame on a specific Column? (`left_index=True, right_on='Col'`).

What does `pd.concat(axis=1)` do? When would you use it?

Why is `ignore_index=True` critical when using `pd.concat(axis=0)`?

Explain the `join='inner'` vs `join='outer'` parameter inside `pd.concat()`.

How do you find rows that exist in the Left table but NOT in the Right table? (Anti-Join using `indicator=True`).

Explain what the `indicator=True` argument does in `pd.merge()`.

How do you merge on multiple columns simultaneously? (`on=['Key1', 'Key2']`).

What is `pd.merge_asof()`? Why is it crucial for financial tick data or time-series?

Explain the performance differences between `.merge()` and `.join()`.

Write code to append a single dictionary as a new row to a DataFrame. Why is `df.loc[len(df)]` or `concat` better than `.append()` (which is deprecated)?

What happens to the data types of integer columns if a merge introduces `NaN` values? (Hint: upcasts to float).

Explain how the new nullable integer type `Int64` solves the NaN float upcasting issue in merges.

Write code to validate a merge (e.g., ensuring it's strictly One-to-Many) using the `validate='1:m'` argument.

How does Pandas handle matching keys that have different data types (e.g., `1` as int vs `'1'` as string)?

What is the difference between `df.update()` and `pd.merge()`?

Explain `df.combine_first()`. When would you use it instead of a merge?