โณ Loading Python Engine...

๐Ÿ“Š Day 23 : Pandas Introduction

๐ŸŽฏ Enterprise Objective

Pandas is the industry standard for tabular data manipulation. It brings SQL-like power to Python in memory. Today we learn how to create DataFrames, load external files, perform Exploratory Data Analysis (EDA), and execute vectorized column math.

๐Ÿ“‹ Strategic Overview

#TopicConcept
1DataFramesTables & Series
2EDAinfo(), describe()
3OperationsColumn math & renaming

1. Pandas Series & DataFrames : Data Structures

๐Ÿ” What is it?
Pandas is the ultimate tool for tabular data (like Excel/SQL in Python). It is built on top of NumPy. A Series is a 1D column with row labels (an index). A DataFrame is a 2D table composed of multiple Series that share the same index.
import pandas as pd
# Creating a DataFrame from a dictionary
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

๐Ÿ’ผ Why Data Analysts Care

โ€ข SQL Replacement: Pandas can perform JOINs, GROUP BYs, and aggregations directly in Python memory

โ€ข Data Cleaning: Pandas has hundreds of built-in methods to handle missing data and transform formats

โš ๏ธ Looping over DataFrames

Never use a for loop to iterate over rows in a DataFrame (e.g., `iterrows()`). It destroys Pandas' vectorized performance. Always use column-level vectorized math or `.apply()`.
In [ ]:

๐Ÿงช Concept Checks: DataFrames

Q1. Import pandas as pd. Create a Series from [10, 20, 30] and print it. Notice the index column.

In [ ]:

Q2. Create a DataFrame from a dictionary of lists: {"Product": ["A", "B"], "Price": [10.5, 20.0]}. Print it.

In [ ]:

Q3. Create a DataFrame from a list of dictionaries (JSON style): [{"A": 1, "B": 2}, {"A": 3, "B": 4}]. Print it.

In [ ]:

Q4. Extract the "Price" column from the Q2 DataFrame and print its type(). It should be a Series.

In [ ]:

Q5. Print the df.index and df.columns attributes of your DataFrame.

In [ ]:

2. Reading & Exploring Data : I/O and EDA

๐Ÿ” What is it?

In the real world, you don't create DataFrames by hand; you read them from CSVs, SQL, or JSON. Once loaded, you use Exploratory Data Analysis (EDA) methods to understand the shape, data types, and missing values in your dataset.

MethodPurpose
pd.read_csv()Load data from a CSV file
df.head(n)View the first n rows (default 5)
df.info()Check column types and missing (Null) values
df.describe()Summary statistics (mean, min, max) for numeric columns
df.shapeTuple of (rows, columns)

๐Ÿ’ผ Why Data Analysts Care

โ€ข Initial Audit: df.info() is always the first command you run to see if your numeric columns accidentally loaded as strings

โ€ข Data Distribution: df.describe() instantly shows if you have massive outliers in your data

๐Ÿง  Pro Tip

If you have a very wide DataFrame, df.head() will truncate columns. You can fix this by running pd.set_option('display.max_columns', None).

In [ ]:

๐Ÿงช Concept Checks: Exploring Data

Q1. Write the theoretical command to read a file named "sales_data.csv" into a DataFrame df.

In [ ]:

Q2. Create a random DataFrame with 20 rows. Print df.head() and df.tail(3).

In [ ]:

Q3. Run df.info() on your DataFrame. What information does the Non-Null Count column provide?

In [ ]:

Q4. Run df.describe(). What is the 50% row representing? (Hint: The Median).

In [ ]:

Q5. Extract the total number of rows from df.shape and print "Total rows: [N]".

In [ ]:

3. Basic Column Operations : Vectorized Math

๐Ÿ” What is it?

Because Pandas is built on NumPy, you can perform math on entire columns instantly without looping. You can easily create new columns by calculating combinations of existing columns.

# Creating a new column
df['Profit'] = df['Revenue'] - df['Cost']

๐Ÿ’ผ Why Data Analysts Care

โ€ข Feature Engineering: Creating a Price_Per_Unit column by dividing Total_Price by Quantity

โ€ข Date Math: Calculating days since a purchase by subtracting Purchase_Date from Today

๐Ÿง  Pro Tip

To drop a column, use df.drop(columns=['ColName']). Remember that Pandas methods usually return a NEW DataFrame. To save it, overwrite the variable: df = df.drop(...).

In [ ]:

๐Ÿงช Concept Checks: Column Math

Q1. Given df with Cost and Revenue, create a new column Profit = Revenue - Cost.

In [ ]:

Q2. Create a column Margin which is Profit / Revenue. Print the DataFrame.

In [ ]:

Q3. Drop the Cost column. Ensure you save the result back to df or a new variable.

In [ ]:

Q4. Add 100 to every value in the Revenue column using df["Revenue"] = df["Revenue"] + 100.

In [ ]:

Q5. Rename the column "Revenue" to "Total_Sales" using df.rename(columns={"Old": "New"}).

In [ ]:

๐Ÿ› ๏ธ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Dictionary to DF): Create a dictionary containing data for 5 employees (Name, Department, Salary). Convert it to a Pandas DataFrame. Print the DataFrame.

In [ ]:

Task 2 (Summary Stats): Create a DataFrame with 1000 rows of random numbers (np.random.rand(1000)). Use df.describe() to find the mean and standard deviation. Print them.

In [ ]:

Task 3 (Currency Conversion): Given a DataFrame with a Price_USD column, create a new column Price_EUR assuming an exchange rate of 0.85. Drop the original USD column.

In [ ]:

Task 4 (Boolean Column): Given a DataFrame with a Score column (0-100), create a new boolean column Passed which is True if Score >= 60, and False otherwise.

In [ ]:

Task 5 (CSV Simulation): Use pathlib to write a small CSV string to data.csv. Then use pd.read_csv('data.csv') to load it into a DataFrame and print df.head().

In [ ]:

๐Ÿ’ป Pure Coding Interview Questions

Q1.

What is the difference between a Pandas Series and a Pandas DataFrame?

In [ ]:

Q2.

Why is Pandas built on top of NumPy? What benefits does it provide?

In [ ]:

Q3.

Explain why iterating over a DataFrame with iterrows() is considered an anti-pattern.

In [ ]:

Q4.

What does df.info() show that df.describe() does not?

In [ ]:

Q5.

How do you read an Excel file in Pandas? What dependency is required? (openpyxl).

In [ ]:

Q6.

What is the inplace=True argument? Why is the Pandas team discouraging its use in newer versions?

In [ ]:

Q7.

How do you change the data type of a column from string to integer? (astype).

In [ ]:

Q8.

Write code to rename multiple columns in a DataFrame using a dictionary.

In [ ]:

Q9.

What happens if you try to add a new column using dot notation df.NewCol = 10 instead of bracket notation df['NewCol'] = 10?

In [ ]:

Q10.

How do you write a DataFrame back to a CSV file without including the index column? (index=False).

In [ ]:

Q11.

Explain the difference between df['A'] and df[['A']] in terms of the object type returned.

In [ ]:

Q12.

How do you sample a random 10% of your DataFrame? (df.sample(frac=0.1)).

In [ ]:

Q13.

What is the Pandas Index? How is it different from a regular column?

In [ ]:

Q14.

How do you set a specific column to be the index of the DataFrame? (set_index).

In [ ]:

Q15.

Write code to drop multiple columns at once.

In [ ]:

Q16.

How do you check memory usage of a DataFrame? (df.info(memory_usage='deep')).

In [ ]:

Q17.

Explain how Pandas handles missing data natively. What object represents a missing number?

In [ ]:

Q18.

Write a vectorized operation that squares the values in Column A and adds them to Column B.

In [ ]:

Q19.

How do you read data from a SQL database directly into a Pandas DataFrame? (read_sql).

In [ ]:

Q20.

What is a categorical data type in Pandas, and when should you use it to save memory?

In [ ]:

Q21.

How do you apply a custom Python function to an entire column? (.apply()).

In [ ]:

Q22.

Explain the difference between .map() and .apply() on a Series.

In [ ]:

Q23.

How do you read a JSON file into Pandas where the records are nested deeply?

In [ ]:

Q24.

Write code to extract all the column names of a DataFrame into a Python list.

In [ ]:

Q25.

What is pd.to_datetime() used for?

In [ ]:

๐Ÿ“Š Day 23 Executive Summary

#TopicKey Takeaway
1StructuresDataFrames are collections of Series sharing an index
2I/OUse pd.read_csv() to load data, df.info() to check it
3Mathdf['New'] = df['A'] + df['B'] calculates row-by-row instantly

โœ… Instructor's End-of-Day Checklist

โ€ข [ ] I can create a DataFrame from a dictionary.

โ€ข [ ] I can check data types and nulls using df.info().

โ€ข [ ] I can create new calculated columns.