๐ Day 28 : Pandas Time Series
๐ฏ Enterprise Objective
Data is rarely clean, and it usually involves Time. Today we master Pandas Accessors (.str and .dt) to clean text and extract date features. We then dive into Time Series analysis, learning how to resample frequencies and calculate Rolling Moving Averages.
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | Accessors | .str.upper(), .dt.year |
| 2 | Resample | Grouping by Time ('M', 'W') |
| 3 | Rolling | Moving Averages (MA) |
1. String and Datetime Accessors : .str and .dt
Pandas provides specialized methods for specific data types using Accessors. If a column contains strings, use the .str accessor to apply string methods (like .upper()). If it contains dates, use .dt to extract the year or month.
# String Accessor
df['Name'] = df['Name'].str.upper().str.strip()
# Datetime Accessor
df['Year'] = df['Date'].dt.year
๐ผ Why Data Analysts Care
โข Text Cleaning: Removing leading/trailing spaces from a column of messy user inputs
โข Seasonality: Extracting the DayOfWeek from a Timestamp to see if weekend sales are higher
โ ๏ธ Type Errors
You cannot use .dt on a string column! You must first convert it: df['Date'] = pd.to_datetime(df['DateString']) before using the accessor.
๐งช Concept Checks: Accessors
Q1. Given df["Text"] = [" hello ", " WORLD "]. Use .str.lower().str.strip() to clean it.
Q2. Create a string column "100 USD". Use .str.replace(" USD", "") to remove the currency.
Q3. Convert df["Date"] = ["2020-01-01"] to datetime using pd.to_datetime(). Print the dtypes.
Q4. Use the .dt accessor to extract the .day from your converted datetime column.
Q5. Use df["Text"].str.contains("hel") to create a boolean mask. Print it.
2. Time Series Resampling : Rolling up Dates
When working with Time Series data (where the Index is a Datetime), you can use .resample(). It is exactly like .groupby(), but specifically for time frequencies (e.g., grouping daily data into monthly averages).
| Frequency | String | Example |
|---|---|---|
| Daily | 'D' | df.resample('D').sum() |
| Weekly | 'W' | df.resample('W').mean() |
| Monthly | 'M' | df.resample('M').max() |
| Quarterly | 'Q' | df.resample('Q').count() |
๐ผ Why Data Analysts Care
โข Financial Aggregations: Converting minute-by-tick stock data into Daily OHLC (Open, High, Low, Close) bars
โข Smoothing: Aggregating noisy daily web traffic into smooth Weekly averages
๐ง Pro Tip
.resample() ONLY works if your DataFrame's index is a DatetimeIndex! Use df.set_index('DateColumn', inplace=True) before attempting to resample.
๐งช Concept Checks: Resampling
Q1. Create dates = pd.date_range("2020-01-01", periods=30, freq="D") and a DF with it as the index.
Q2. Fill the DF with a Sales column of random integers. Print df.head().
Q3. Resample the daily data to Weekly ("W") and calculate the .sum(). Print it.
Q4. Resample the data to Monthly ("M") and calculate the .mean(). Print it.
Q5. What happens if you try to resample a DataFrame whose index is just integers (0,1,2...)? Catch the TypeError.
3. Rolling Windows : Moving Averages
A Rolling Window calculates statistics over a sliding window of time. The most common use case is the Moving Average, which smooths out short-term fluctuations in data to highlight longer-term trends.
# Calculate a 7-day moving average
df['7D_MA'] = df['Sales'].rolling(window=7).mean()
๐ผ Why Data Analysts Care
โข Stock Analysis: Calculating the 50-day and 200-day moving averages to find 'Golden Cross' buy signals
โข Trend Detection: Smoothing out weekday vs weekend sales spikes to see the true month-over-month growth
๐ง Pro Tip
The first N-1 rows of a rolling window of size N will result in NaN, because there isn't enough historical data to calculate the full window yet. You can use min_periods=1 to calculate partial windows.
๐งช Concept Checks: Rolling
Q1. Given df["Prices"] = [10, 20, 30, 40, 50]. Calculate a rolling mean with window=2.
Q2. Observe the NaN in the first row. Add min_periods=1 to the .rolling() call and print again.
Q3. Calculate a rolling .max() with window=3. Print it.
Q4. Calculate an expanding sum (cumulative sum) using df["Prices"].expanding().sum(). Print it.
Q5. Explain why moving averages are essential for visualizing highly volatile daily data.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (String Pipeline): Create df = pd.DataFrame({'Code': [' id-01 ', 'id-02', ' ID-03 ']}). Write a chained pipeline using .str to: strip whitespace, convert to uppercase, and replace '-' with '_'. Result should be 'ID_01'.
Task 2 (Datetime Features): Create a single Datetime column for '2023-12-25'. Use the .dt accessor to extract the Year, Month, Day, and day_name() into 4 separate new columns.
Task 3 (Monthly Aggregation): Generate 365 days of random sales data for the year 2023. Set the date as the index. Resample to Monthly ('M') and find the Total Sales and Max Daily Sale per month using .agg(['sum', 'max']).
Task 4 (Golden Cross): Generate 100 days of stock prices. Calculate the 10-day moving average and the 30-day moving average. Add both as columns to the DataFrame.
Task 5 (Shift and Differences): Use the .shift(1) method on a Sales column to create a Previous_Day_Sales column. Then calculate the Daily_Growth by subtracting them. This is how Day-over-Day metrics are made!
๐ป Pure Coding Interview Questions
Q1.
Explain the difference between the .str accessor and the .dt accessor.
Q2.
How do you convert a string column 'Jan 15, 2023' into a Pandas datetime object?
Q3.
What happens if you use pd.to_datetime() on a column that has mixed or European date formats? (Hint: dayfirst=True or format=...).
Q4.
Write code to extract the day of the week (e.g., 'Monday') from a datetime column.
Q5.
Explain what a DatetimeIndex is. Why is it required for .resample()?
Q6.
What is the difference between .resample('M').mean() and .groupby(df.index.month).mean()?
Q7.
How do you handle the NaN values generated by a .rolling(window=30) calculation?
Q8.
Explain the difference between a Rolling window and an Expanding window.
Q9.
Write code to calculate the Exponential Moving Average (EMA) using .ewm().
Q10.
How do you calculate the day-over-day percentage change in Pandas? (Hint: .pct_change()).
Q11.
Write code using .str.contains() with a regex to filter rows that contain an email address.
Q12.
How do you split a string column 'First Last' into two separate columns 'First' and 'Last'? (.str.split(expand=True)).
Q13.
Explain .shift(). How is it used to calculate differences between consecutive rows?
Q14.
What does .diff() do? How does it relate to .shift()?
Q15.
How do you handle Time Zones in Pandas datetime objects? (.dt.tz_localize() and .tz_convert()).
Q16.
Write a .resample() operation that aggregates trade data into OHLC (Open, High, Low, Close) bars.
Q17.
Explain the pd.DateOffset object and how it's used to add exactly one month to a datetime column.
Q18.
How do you fill missing dates in a Time Series so that every single day has a row? (Hint: .asfreq('D')).
Q19.
Write code to interpolate missing values in a time series quadratically.
Q20.
What is a Timedelta? How do you calculate the number of days between two datetime columns?
Q21.
Explain how to use .str.get_dummies() for One-Hot Encoding a categorical string column.
Q22.
Write code to filter a Time Series DataFrame to only include business days (Monday-Friday).
Q23.
How do you slice a Time Series DataFrame using partial string indexing? (e.g., df.loc['2023-01']).
Q24.
What is 'Look-ahead bias' in Time Series analysis, and how do rolling windows help prevent it?
Q25.
Write code to calculate the rolling standard deviation (volatility) of a stock price over a 20-day window.
๐ Day 28 Executive Summary
| # | Topic | Key Takeaway |
|---|---|---|
| 1 | .str / .dt | Unlock hundreds of type-specific methods on entire columns |
| 2 | Resample | It's just groupby but for dates. Requires a DatetimeIndex. |
| 3 | Rolling | Calculates stats over a sliding window. First N rows will be NaN. |
โ Instructor's End-of-Day Checklist
โข [ ] I can use .str to clean text columns.
โข [ ] I can .resample() daily data into monthly data.
โข [ ] I can calculate a 7-day rolling moving average.