Day 24: Pandas Selection

🎯 Enterprise Objective

Knowing how to navigate tables is the most heavily tested skill in Data Analyst interviews. Today we master precise row/column extraction using .loc and .iloc, SQL-style filtering using Boolean Masks and .query(), and leaderboard generation via sorting.

📋 Strategic Overview

#	Topic	Concept
1	Loc/Iloc	Exact Indexing
2	Masking/Query	Conditional Filtering
3	Sorting	`sort_values`, `nlargest`

1. Loc and Iloc : Row and Column Selection

🔍 What is it?

Selecting data in Pandas requires precision. .iloc[] selects by integer position (like standard Python lists). .loc[] selects by index label and column name. Both follow the format [rows, columns].

Accessor	Paradigm	Example	Meaning
`.iloc`	Integer Position	`df.iloc[0:5, 0:2]`	First 5 rows, first 2 columns
`.loc`	Label / Name	`df.loc[:, 'Age':'City']`	All rows, columns from Age to City

💼 Why Data Analysts Care

• Feature Selection: Extracting the target variable Y = df['Price'] and the features X = df.iloc[:, :-1] for Machine Learning

⚠️ Inclusive Loc

A major 'gotcha': Slicing with .iloc[0:5] is EXCLUSIVE of 5 (returns 0,1,2,3,4). But slicing with .loc['A':'C'] is INCLUSIVE of 'C' (returns A,B,C).

In [ ]:

🧪 Concept Checks: `Loc / Iloc`

Q1. Given df, use .iloc to select the first 3 rows and the first 2 columns.

In [ ]:

Q2. Use .loc to select all rows (:), but only the "Name" and "Salary" columns.

In [ ]:

Q3. What happens if you try df.iloc[0, "Name"]? Catch the TypeError (iloc requires integers).

In [ ]:

Q4. Use .loc to slice rows from "ID2" to "ID4" inclusive. Print the result.

In [ ]:

Q5. Extract the single scalar value in row 0, column 1 using .iloc[0, 1]. Print it.

In [ ]:

2. Boolean Filtering : Querying Data

🔍 What is it?

You can filter DataFrames exactly like NumPy arrays using Boolean Masks. Create a condition, and pass it into the DataFrame brackets: df[condition]. Pandas also provides the highly readable .query() method for SQL-like string queries.

# Standard Masking
rich_users = df[df['Salary'] > 75000]

# Multiple conditions (Requires parenthesis and bitwise &)
targets = df[(df['Age'] > 30) & (df['City'] == 'NY')]

# The elegant .query() alternative
targets = df.query("Age > 30 and City == 'NY'")

💼 Why Data Analysts Care

• Cohort Extraction: Filtering a massive dataset down to only active users from a specific country

• Date Filtering: df[df['Date'] >= '2024-01-01'] to extract year-to-date performance

🧠 Pro Tip

When using .query(), you can refer to external Python variables by prefixing them with an @ symbol: df.query('Age > @min_age').

In [ ]:

🧪 Concept Checks: `Filtering`

Q1. Create a mask for Age >= 30. Use it to filter df and print the result.

In [ ]:

Q2. Filter df for people who are in "IT" OR (|) "Sales". (Remember parentheses around conditions).

In [ ]:

Q3. Use the .isin() method to achieve the exact same result as Q2. Print it.

In [ ]:

Q4. Use the .query() method to find rows where Age < 40 and Dept == "HR". Print the result.

In [ ]:

Q5. Filter the DataFrame to keep rows where the Name starts with "A" using df[df["Name"].str.startswith("A")].

In [ ]:

3. Sorting and Ranking : Ordering Data

🔍 What is it?

Once data is filtered, we often need to sort it to find the top/bottom performers using .sort_values(). You can sort by multiple columns and specify ascending/descending order.

Method	Purpose	Example
`sort_values(by=)`	Sort rows	`df.sort_values(by='Age', ascending=False)`
`nlargest(n, col)`	Top N rows	`df.nlargest(5, 'Salary')` (Faster than sorting!)
`rank()`	Assign ranks	`df['Salary'].rank(ascending=False)`

💼 Why Data Analysts Care

• Leaderboards: Finding the top 10 highest revenue generating products

• Time Series: Ensuring financial data is strictly sorted chronologically before calculating moving averages

🧠 Pro Tip

If you just need the top 5 values, df.nlargest(5, 'Col') is computationally faster than sorting the entire million-row DataFrame and calling .head(5).

In [ ]:

🧪 Concept Checks: `Sorting`

Q1. Sort df by "Time" in ascending order (fastest to slowest). Print the result.

In [ ]:

Q2. Sort df by "Name" alphabetically. Print it.

In [ ]:

Q3. Use .nsmallest(2, "Time") to find the two fastest times. Print the result.

In [ ]:

Q4. Create a new column "Rank" using df["Score"].rank(ascending=False, method="min"). Print df.

In [ ]:

Q5. Sort the DataFrame by the index (if it was scrambled) using df.sort_index(). Print it.

In [ ]:

🛠️ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Matrix Extraction): Create a 5x5 DataFrame of random numbers. Use .iloc to extract a 3x3 subset from the very middle of the DataFrame.

In [ ]:

Task 2 (Cohort Filter): Create a DataFrame of 10 users with Age and Country. Filter for users Age > 18 AND Country == 'USA'. Extract only their Name column using .loc[mask, 'Name'].

In [ ]:

Task 3 (Dynamic Query): Create a variable min_salary = 60000. Use df.query("Salary > @min_salary") to filter a DataFrame of employees. Print the result.

In [ ]:

Task 4 (Top Performers): Create a DataFrame of 100 random student scores. Find the top 5 scores using .nlargest(). Then sort the whole DataFrame and use .head(5). Verify they match.

In [ ]:

Task 5 (Text Filtering): Create a DataFrame of product names. Filter the DataFrame to only keep rows where the product name contains the word 'Pro' using df['Product'].str.contains('Pro').

In [ ]:

💻 Pure Coding Interview Questions

Q1.

What is the difference between `.loc` and `.iloc`?

In [ ]:

Q2.

Explain why `.iloc[0:2]` returns 2 rows, but `.loc[0:2]` might return 3 rows (if the index is integers).

In [ ]:

Q3.

How do you filter a DataFrame based on values in a list? (Hint: `.isin()`).

In [ ]:

Q4.

Write a boolean mask to filter out rows where a specific column contains `NaN`. (Hint: `.notna()`).

In [ ]:

Q5.

What is the `.query()` method and why might you choose it over standard boolean masking?

In [ ]:

Q6.

How do you reference external variables inside a `.query()` string?

In [ ]:

Q7.

Explain what `SettingWithCopyWarning` is. How do you prevent it using `.copy()`?

In [ ]:

Q8.

Write code to sort a DataFrame by Column A descending, and then Column B ascending.

In [ ]:

Q9.

What is the difference between `sort_values` and `sort_index`?

In [ ]:

Q10.

Why is `nlargest(5, 'A')` generally preferred over `sort_values('A').head(5)`?

In [ ]:

Q11.

How do you select a single scalar value from a DataFrame extremely fast? (Hint: `.at` and `.iat`).

In [ ]:

Q12.

Write code to filter a DataFrame using a regular expression on a string column. (`.str.contains(regex=True)`).

In [ ]:

Q13.

How do you select columns based on their data type? (`df.select_dtypes(include='number')`).

In [ ]:

Q14.

Explain how the `~` operator is used in Pandas boolean masking.

In [ ]:

Q15.

Write code to invert a boolean mask.

In [ ]:

Q16.

How do you update the values of a column only for specific rows using `.loc`?

In [ ]:

Q17.

What happens if you assign a list of values to a new column, but the list length doesn't match the DataFrame length?

In [ ]:

Q18.

Explain the `rank()` method. How does it handle tied values by default?

In [ ]:

Q19.

How do you filter a DataFrame based on the length of a string in a column? (`df['Col'].str.len() > 5`).

In [ ]:

Q20.

Write code to drop all rows where ANY column has a missing value.

In [ ]:

Q21.

How do you reset the index of a DataFrame after filtering it? Why is `drop=True` important?

In [ ]:

Q22.

Explain how boolean masking leverages NumPy's vectorized operations under the hood.

In [ ]:

Q23.

What is a `MultiIndex` (Hierarchical Index)? How do you select data from it using `.loc`?

In [ ]:

Q24.

Write code to select every alternate row in a DataFrame using `.iloc` and step slicing.

In [ ]:

Q25.

How do you randomly shuffle the rows of a DataFrame using `.sample()`?

In [ ]:

📊 Day 24 Executive Summary

#	Topic	Key Takeaway
1	loc/iloc	`iloc` is integer position (exclusive end). `loc` is label name (inclusive end).
2	Masking	`df[df['A'] > 5]` filters rows. Combine with `&` and `	`.
3	Query	`df.query("A > 5")` is highly readable for complex logic.

✅ Instructor's End-of-Day Checklist

• [ ] I can extract rows and columns using .loc and .iloc.

• [ ] I can filter a DataFrame using multiple conditions.

• [ ] I can sort data and find the top N records.

🚀 Continue to Day 25 - Pandas Data Cleaning & Missing Values

📊 Day 24 : Pandas Selection

🎯 Enterprise Objective

📋 Strategic Overview

1. Loc and Iloc : Row and Column Selection

💼 Why Data Analysts Care

⚠️ Inclusive Loc

🧪 Concept Checks: Loc / Iloc

2. Boolean Filtering : Querying Data

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Filtering

3. Sorting and Ranking : Ordering Data

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Sorting

🛠️ Professional Practice Tasks

💻 Pure Coding Interview Questions

What is the difference between .loc and .iloc?

Explain why .iloc[0:2] returns 2 rows, but .loc[0:2] might return 3 rows (if the index is integers).

How do you filter a DataFrame based on values in a list? (Hint: .isin()).

Write a boolean mask to filter out rows where a specific column contains NaN. (Hint: .notna()).

What is the .query() method and why might you choose it over standard boolean masking?

How do you reference external variables inside a .query() string?

Explain what SettingWithCopyWarning is. How do you prevent it using .copy()?

Write code to sort a DataFrame by Column A descending, and then Column B ascending.

What is the difference between sort_values and sort_index?

Why is nlargest(5, 'A') generally preferred over sort_values('A').head(5)?

How do you select a single scalar value from a DataFrame extremely fast? (Hint: .at and .iat).

Write code to filter a DataFrame using a regular expression on a string column. (.str.contains(regex=True)).

How do you select columns based on their data type? (df.select_dtypes(include='number')).

Explain how the ~ operator is used in Pandas boolean masking.

Write code to invert a boolean mask.

How do you update the values of a column only for specific rows using .loc?

What happens if you assign a list of values to a new column, but the list length doesn't match the DataFrame length?

Explain the rank() method. How does it handle tied values by default?

How do you filter a DataFrame based on the length of a string in a column? (df['Col'].str.len() > 5).

Write code to drop all rows where ANY column has a missing value.

How do you reset the index of a DataFrame after filtering it? Why is drop=True important?

Explain how boolean masking leverages NumPy's vectorized operations under the hood.

What is a MultiIndex (Hierarchical Index)? How do you select data from it using .loc?

Write code to select every alternate row in a DataFrame using .iloc and step slicing.

How do you randomly shuffle the rows of a DataFrame using .sample()?

📊 Day 24 Executive Summary

✅ Instructor's End-of-Day Checklist

🧪 Concept Checks: `Loc / Iloc`

🧪 Concept Checks: `Filtering`

🧪 Concept Checks: `Sorting`

What is the difference between `.loc` and `.iloc`?

Explain why `.iloc[0:2]` returns 2 rows, but `.loc[0:2]` might return 3 rows (if the index is integers).

How do you filter a DataFrame based on values in a list? (Hint: `.isin()`).

Write a boolean mask to filter out rows where a specific column contains `NaN`. (Hint: `.notna()`).

What is the `.query()` method and why might you choose it over standard boolean masking?

How do you reference external variables inside a `.query()` string?

Explain what `SettingWithCopyWarning` is. How do you prevent it using `.copy()`?

What is the difference between `sort_values` and `sort_index`?

Why is `nlargest(5, 'A')` generally preferred over `sort_values('A').head(5)`?

How do you select a single scalar value from a DataFrame extremely fast? (Hint: `.at` and `.iat`).

Write code to filter a DataFrame using a regular expression on a string column. (`.str.contains(regex=True)`).

How do you select columns based on their data type? (`df.select_dtypes(include='number')`).

Explain how the `~` operator is used in Pandas boolean masking.

How do you update the values of a column only for specific rows using `.loc`?

Explain the `rank()` method. How does it handle tied values by default?

How do you filter a DataFrame based on the length of a string in a column? (`df['Col'].str.len() > 5`).

How do you reset the index of a DataFrame after filtering it? Why is `drop=True` important?

What is a `MultiIndex` (Hierarchical Index)? How do you select data from it using `.loc`?

Write code to select every alternate row in a DataFrame using `.iloc` and step slicing.

How do you randomly shuffle the rows of a DataFrame using `.sample()`?