๐ Day 24 : Pandas Selection
๐ฏ Enterprise Objective
Knowing how to navigate tables is the most heavily tested skill in Data Analyst interviews. Today we master precise row/column extraction using .loc and .iloc, SQL-style filtering using Boolean Masks and .query(), and leaderboard generation via sorting.
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | Loc/Iloc | Exact Indexing |
| 2 | Masking/Query | Conditional Filtering |
| 3 | Sorting | sort_values, nlargest |
1. Loc and Iloc : Row and Column Selection
Selecting data in Pandas requires precision. .iloc[] selects by integer position (like standard Python lists). .loc[] selects by index label and column name. Both follow the format [rows, columns].
| Accessor | Paradigm | Example | Meaning |
|---|---|---|---|
.iloc | Integer Position | df.iloc[0:5, 0:2] | First 5 rows, first 2 columns |
.loc | Label / Name | df.loc[:, 'Age':'City'] | All rows, columns from Age to City |
๐ผ Why Data Analysts Care
โข Feature Selection: Extracting the target variable Y = df['Price'] and the features X = df.iloc[:, :-1] for Machine Learning
โ ๏ธ Inclusive Loc
A major 'gotcha': Slicing with .iloc[0:5] is EXCLUSIVE of 5 (returns 0,1,2,3,4). But slicing with .loc['A':'C'] is INCLUSIVE of 'C' (returns A,B,C).
๐งช Concept Checks: Loc / Iloc
Q1. Given df, use .iloc to select the first 3 rows and the first 2 columns.
Q2. Use .loc to select all rows (:), but only the "Name" and "Salary" columns.
Q3. What happens if you try df.iloc[0, "Name"]? Catch the TypeError (iloc requires integers).
Q4. Use .loc to slice rows from "ID2" to "ID4" inclusive. Print the result.
Q5. Extract the single scalar value in row 0, column 1 using .iloc[0, 1]. Print it.
2. Boolean Filtering : Querying Data
You can filter DataFrames exactly like NumPy arrays using Boolean Masks. Create a condition, and pass it into the DataFrame brackets: df[condition]. Pandas also provides the highly readable .query() method for SQL-like string queries.
# Standard Masking
rich_users = df[df['Salary'] > 75000]
# Multiple conditions (Requires parenthesis and bitwise &)
targets = df[(df['Age'] > 30) & (df['City'] == 'NY')]
# The elegant .query() alternative
targets = df.query("Age > 30 and City == 'NY'")
๐ผ Why Data Analysts Care
โข Cohort Extraction: Filtering a massive dataset down to only active users from a specific country
โข Date Filtering: df[df['Date'] >= '2024-01-01'] to extract year-to-date performance
๐ง Pro Tip
When using .query(), you can refer to external Python variables by prefixing them with an @ symbol: df.query('Age > @min_age').
๐งช Concept Checks: Filtering
Q1. Create a mask for Age >= 30. Use it to filter df and print the result.
Q2. Filter df for people who are in "IT" OR (|) "Sales". (Remember parentheses around conditions).
Q3. Use the .isin() method to achieve the exact same result as Q2. Print it.
Q4. Use the .query() method to find rows where Age < 40 and Dept == "HR". Print the result.
Q5. Filter the DataFrame to keep rows where the Name starts with "A" using df[df["Name"].str.startswith("A")].
3. Sorting and Ranking : Ordering Data
Once data is filtered, we often need to sort it to find the top/bottom performers using .sort_values(). You can sort by multiple columns and specify ascending/descending order.
| Method | Purpose | Example |
|---|---|---|
sort_values(by=) | Sort rows | df.sort_values(by='Age', ascending=False) |
nlargest(n, col) | Top N rows | df.nlargest(5, 'Salary') (Faster than sorting!) |
rank() | Assign ranks | df['Salary'].rank(ascending=False) |
๐ผ Why Data Analysts Care
โข Leaderboards: Finding the top 10 highest revenue generating products
โข Time Series: Ensuring financial data is strictly sorted chronologically before calculating moving averages
๐ง Pro Tip
If you just need the top 5 values, df.nlargest(5, 'Col') is computationally faster than sorting the entire million-row DataFrame and calling .head(5).
๐งช Concept Checks: Sorting
Q1. Sort df by "Time" in ascending order (fastest to slowest). Print the result.
Q2. Sort df by "Name" alphabetically. Print it.
Q3. Use .nsmallest(2, "Time") to find the two fastest times. Print the result.
Q4. Create a new column "Rank" using df["Score"].rank(ascending=False, method="min"). Print df.
Q5. Sort the DataFrame by the index (if it was scrambled) using df.sort_index(). Print it.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (Matrix Extraction): Create a 5x5 DataFrame of random numbers. Use .iloc to extract a 3x3 subset from the very middle of the DataFrame.
Task 2 (Cohort Filter): Create a DataFrame of 10 users with Age and Country. Filter for users Age > 18 AND Country == 'USA'. Extract only their Name column using .loc[mask, 'Name'].
Task 3 (Dynamic Query): Create a variable min_salary = 60000. Use df.query("Salary > @min_salary") to filter a DataFrame of employees. Print the result.
Task 4 (Top Performers): Create a DataFrame of 100 random student scores. Find the top 5 scores using .nlargest(). Then sort the whole DataFrame and use .head(5). Verify they match.
Task 5 (Text Filtering): Create a DataFrame of product names. Filter the DataFrame to only keep rows where the product name contains the word 'Pro' using df['Product'].str.contains('Pro').
๐ป Pure Coding Interview Questions
Q1.
What is the difference between .loc and .iloc?
Q2.
Explain why .iloc[0:2] returns 2 rows, but .loc[0:2] might return 3 rows (if the index is integers).
Q3.
How do you filter a DataFrame based on values in a list? (Hint: .isin()).
Q4.
Write a boolean mask to filter out rows where a specific column contains NaN. (Hint: .notna()).
Q5.
What is the .query() method and why might you choose it over standard boolean masking?
Q6.
How do you reference external variables inside a .query() string?
Q7.
Explain what SettingWithCopyWarning is. How do you prevent it using .copy()?
Q8.
Write code to sort a DataFrame by Column A descending, and then Column B ascending.
Q9.
What is the difference between sort_values and sort_index?
Q10.
Why is nlargest(5, 'A') generally preferred over sort_values('A').head(5)?
Q11.
How do you select a single scalar value from a DataFrame extremely fast? (Hint: .at and .iat).
Q12.
Write code to filter a DataFrame using a regular expression on a string column. (.str.contains(regex=True)).
Q13.
How do you select columns based on their data type? (df.select_dtypes(include='number')).
Q14.
Explain how the ~ operator is used in Pandas boolean masking.
Q15.
Write code to invert a boolean mask.
Q16.
How do you update the values of a column only for specific rows using .loc?
Q17.
What happens if you assign a list of values to a new column, but the list length doesn't match the DataFrame length?
Q18.
Explain the rank() method. How does it handle tied values by default?
Q19.
How do you filter a DataFrame based on the length of a string in a column? (df['Col'].str.len() > 5).
Q20.
Write code to drop all rows where ANY column has a missing value.
Q21.
How do you reset the index of a DataFrame after filtering it? Why is drop=True important?
Q22.
Explain how boolean masking leverages NumPy's vectorized operations under the hood.
Q23.
What is a MultiIndex (Hierarchical Index)? How do you select data from it using .loc?
Q24.
Write code to select every alternate row in a DataFrame using .iloc and step slicing.
Q25.
How do you randomly shuffle the rows of a DataFrame using .sample()?
๐ Day 24 Executive Summary
| # | Topic | Key Takeaway | |
|---|---|---|---|
| 1 | loc/iloc | iloc is integer position (exclusive end). loc is label name (inclusive end). | |
| 2 | Masking | df[df['A'] > 5] filters rows. Combine with & and ` | `. |
| 3 | Query | df.query("A > 5") is highly readable for complex logic. |
โ Instructor's End-of-Day Checklist
โข [ ] I can extract rows and columns using .loc and .iloc.
โข [ ] I can filter a DataFrame using multiple conditions.
โข [ ] I can sort data and find the top N records.