Day 15: File Handling

🎯 Enterprise Objective

Data Analyst pipelines start by reading data and end by saving data. Today we master Disk I/O, parsing the universal language of the web (JSON), and modernizing our code using the powerful, object-oriented pathlib module.

📋 Strategic Overview

#	Topic	Concept
1	File I/O	`open()`, `with` context
2	JSON	`loads()`, `dumps()`
3	Pathlib	Object-oriented paths

1. Reading & Writing Files : Disk I/O

🔍 What is it?

Data lives in files. Python interacts with files using the open(filename, mode) function. Modes include 'r' (read), 'w' (write/overwrite), and 'a' (append). You should ALWAYS use a Context Manager (the with statement) to ensure the file is closed properly, even if an error occurs.

with open('data.txt', 'w') as f:
    f.write('Hello World\n')

💼 Why Data Analysts Care

• Log Processing: Reading multi-gigabyte log files line-by-line without running out of RAM

• Data Exports: Saving analysis results to local text files

⚠️ Memory Leaks

If you do f = open('data.txt') and forget to call f.close(), the file remains locked in memory. Always use with open(...) as f:.

In [ ]:

🧪 Concept Checks: `File I/O`

Q1. Write code to open a file "test.txt" in write mode ("w") and write your name to it.

In [ ]:

Q2. Open "test.txt" in read mode ("r"). Read the contents and print them.

In [ ]:

Q3. Open "test.txt" in append mode ("a"). Add a new line "Welcome to Python". Print the full file again.

In [ ]:

Q4. Explain why with open() as f: is superior to f = open(); f.read(); f.close().

In [ ]:

Q5. Write a memory-efficient for loop to read a file line-by-line. (Assume file is "test.txt").

In [ ]:

2. Parsing JSON : The Language of the Web

🔍 What is it?

JSON (JavaScript Object Notation) is the universal format for web APIs. It maps perfectly to Python dictionaries and lists. The built-in json module provides tools to parse strings into dicts (loads) and serialize dicts into strings (dumps).

Function	Purpose	Input -> Output
`json.loads(s)`	Load String	String -> Dictionary
`json.dumps(d)`	Dump String	Dictionary -> String
`json.load(f)`	Load File	File Object -> Dictionary
`json.dump(d, f)`	Dump File	Dictionary -> File Object

💼 Why Data Analysts Care

• API Integration: Parsing REST API responses (which are almost always JSON)

• Configuration: Loading application settings from a .json file

🧠 Pro Tip

Use json.dumps(data, indent=4) to 'pretty-print' complex dictionaries for easy debugging.

In [ ]:

🧪 Concept Checks: `JSON`

Q1. Import json. Use json.loads() to parse '{"x": 10, "y": 20}' into a dictionary.

In [ ]:

Q2. Convert the dictionary {"color": "red", "sizes": [1, 2]} to a JSON string using json.dumps(). Print it.

In [ ]:

Q3. Use json.dumps() with the indent=4 argument to pretty-print {"a": {"b": 1}}.

In [ ]:

Q4. Write code to save a dictionary d directly to a file "data.json" using with open() and json.dump().

In [ ]:

Q5. Read "data.json" back into a dictionary using json.load(). Print the type of the loaded object.

In [ ]:

3. Pathlib : Modern File Paths

🔍 What is it?

Handling file paths as strings (e.g., 'data/users.txt') causes bugs across different operating systems (Windows uses \, Mac/Linux use /). The modern Pythonic way is the pathlib module, which treats paths as objects.

from pathlib import Path

# Object-oriented paths
folder = Path('data')
file_path = folder / 'users.txt'  # The / operator intelligently joins paths!

💼 Why Data Analysts Care

• Cross-Platform Code: Write code on a Mac that executes flawlessly on a Windows server

• File Operations: Easily check if a file exists, get its suffix (e.g., .csv), or read its text instantly

🧠 Pro Tip

Pathlib objects have amazing built-in methods: path.exists(), path.read_text(), and path.suffix. Use them instead of the older os.path module.

In [ ]:

🧪 Concept Checks: `Pathlib`

Q1. Import Path from pathlib. Create a Path object for "folder" / "subfolder" / "file.csv". Print it.

In [ ]:

Q2. Create a Path object p = Path("demo.txt"). Use p.write_text("Hello") to create the file.

In [ ]:

Q3. Use p.read_text() to read the file created in Q2 and print it. Then check p.exists().

In [ ]:

Q4. Create a Path for "image.jpg". Print its .suffix and .stem (the name without extension).

In [ ]:

Q5. Use Path.cwd() to get the current working directory. Print it.

In [ ]:

🛠️ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Log Parser): Create a file server.log with 5 lines: 2 containing 'ERROR', 3 containing 'INFO'. Write a memory-efficient loop to read the file and print ONLY the 'ERROR' lines.

In [ ]:

Task 2 (JSON Config Updater): Write a function update_config(file_path, key, val). It should read a JSON file (or create {} if missing), update the key, and save the JSON back to the file.

In [ ]:

Task 3 (File Extension Counter): Create 3 files: a.txt, b.csv, c.txt in a new folder using Pathlib. Write a function that uses Path.iterdir() to iterate the folder and count how many .txt files exist.

In [ ]:

Task 4 (CSV to JSON): Write a simulated CSV string (e.g., 'id,name\n1,Alice\n2,Bob'). Parse it manually using split('\n') and split(','), convert to a list of dicts, and json.dumps() it.

In [ ]:

Task 5 (Safe File Reader): Write a function read_safe(path) that uses pathlib to check if a file exists. If so, return its text. If not, return None. Test with a valid and invalid path.

In [ ]:

💻 Pure Coding Interview Questions

Q1.

Explain the difference between `open('f.txt', 'w')` and `open('f.txt', 'a')`.

In [ ]:

Q2.

Why is it essential to use a context manager (`with` statement) when opening files?

In [ ]:

Q3.

Explain the difference between `f.read()`, `f.readline()`, and `f.readlines()`.

In [ ]:

Q4.

How do you read a 50GB file in Python without running out of RAM?

In [ ]:

Q5.

Explain the difference between `json.loads()` and `json.load()`.

In [ ]:

Q6.

Write code to parse a JSON string, extract a specific field, and handle a `json.JSONDecodeError`.

In [ ]:

Q7.

Why shouldn't you use regular expressions to parse JSON or HTML?

In [ ]:

Q8.

Compare `os.path.join` with `pathlib`'s `/` operator. Why is pathlib preferred in modern Python?

In [ ]:

Q9.

Write a script that uses `pathlib` to rename all `.txt` files in a directory to `.md`.

In [ ]:

Q10.

How do you write a list of dictionaries to a CSV file without using Pandas (using the `csv` module)?

In [ ]:

Q11.

Explain how character encodings work in Python. Why should you often use `encoding='utf-8'` in `open()`?

In [ ]:

Q12.

Write code that safely creates a nested directory structure (e.g., `a/b/c`) if it doesn't exist.

In [ ]:

Q13.

What is the Pickle module? Why is `json` generally preferred over `pickle` for data serialization?

In [ ]:

Q14.

Write a generator function that reads a file and yields chunks of 1024 bytes at a time.

In [ ]:

Q15.

Explain the security risks of using `yaml.load()` or `pickle.loads()` on untrusted data.

In [ ]:

Q16.

Write code using the `tempfile` module to create a temporary file, write data, and auto-delete it.

In [ ]:

Q17.

How do you handle file locking in Python if two processes try to write to the same file simultaneously?

In [ ]:

Q18.

Explain what the `file` variable is and how it's used to find relative asset paths.

In [ ]:

Q19.

Write a function that recursively finds all files larger than 1MB in a directory using `pathlib`.

In [ ]:

Q20.

How do you handle reading a file that might be locked or currently being written to by another program?

In [ ]:

Q21.

Write code using `shutil` to copy a file and preserve its metadata.

In [ ]:

Q22.

Explain the purpose of `StringIO` and `BytesIO` in the `io` module. When would you use them?

In [ ]:

Q23.

Write a script that merges 5 different JSON files into a single master JSON file.

In [ ]:

Q24.

How does Pandas `read_csv` differ fundamentally from the standard library `csv.reader`?

In [ ]:

Q25.

Write code to extract a ZIP file using the `zipfile` module or `shutil.unpack_archive`.

In [ ]:

📊 Day 15 Executive Summary

#	Topic	Key Takeaway
1	I/O	ALWAYS use context managers (`with`)
2	JSON	The bridge between Python dicts and the web
3	Pathlib	Replaces messy `os.path` strings

✅ Instructor's End-of-Day Checklist

• [ ] I can safely open, read, and close a file.

• [ ] I can parse a JSON string into a dictionary.

• [ ] I can use pathlib to construct safe file paths.

🚀 Continue to Day 16 - Object-Oriented Programming (OOP) Basics

📊 Day 15 : File Handling

🎯 Enterprise Objective

📋 Strategic Overview

1. Reading & Writing Files : Disk I/O

💼 Why Data Analysts Care

⚠️ Memory Leaks

🧪 Concept Checks: File I/O

2. Parsing JSON : The Language of the Web

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: JSON

3. Pathlib : Modern File Paths

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Pathlib

🛠️ Professional Practice Tasks

💻 Pure Coding Interview Questions

Explain the difference between open('f.txt', 'w') and open('f.txt', 'a').

Why is it essential to use a context manager (with statement) when opening files?

Explain the difference between f.read(), f.readline(), and f.readlines().

How do you read a 50GB file in Python without running out of RAM?

Explain the difference between json.loads() and json.load().

Write code to parse a JSON string, extract a specific field, and handle a json.JSONDecodeError.

Why shouldn't you use regular expressions to parse JSON or HTML?

Compare os.path.join with pathlib's / operator. Why is pathlib preferred in modern Python?

Write a script that uses pathlib to rename all .txt files in a directory to .md.

How do you write a list of dictionaries to a CSV file without using Pandas (using the csv module)?

Explain how character encodings work in Python. Why should you often use encoding='utf-8' in open()?

Write code that safely creates a nested directory structure (e.g., a/b/c) if it doesn't exist.

What is the Pickle module? Why is json generally preferred over pickle for data serialization?

Write a generator function that reads a file and yields chunks of 1024 bytes at a time.

Explain the security risks of using yaml.load() or pickle.loads() on untrusted data.

Write code using the tempfile module to create a temporary file, write data, and auto-delete it.

How do you handle file locking in Python if two processes try to write to the same file simultaneously?

Explain what the file variable is and how it's used to find relative asset paths.

Write a function that recursively finds all files larger than 1MB in a directory using pathlib.

How do you handle reading a file that might be locked or currently being written to by another program?

Write code using shutil to copy a file and preserve its metadata.

Explain the purpose of StringIO and BytesIO in the io module. When would you use them?

Write a script that merges 5 different JSON files into a single master JSON file.

How does Pandas read_csv differ fundamentally from the standard library csv.reader?

Write code to extract a ZIP file using the zipfile module or shutil.unpack_archive.

📊 Day 15 Executive Summary

✅ Instructor's End-of-Day Checklist

🧪 Concept Checks: `File I/O`

🧪 Concept Checks: `JSON`

🧪 Concept Checks: `Pathlib`

Explain the difference between `open('f.txt', 'w')` and `open('f.txt', 'a')`.

Why is it essential to use a context manager (`with` statement) when opening files?

Explain the difference between `f.read()`, `f.readline()`, and `f.readlines()`.

Explain the difference between `json.loads()` and `json.load()`.

Write code to parse a JSON string, extract a specific field, and handle a `json.JSONDecodeError`.

Compare `os.path.join` with `pathlib`'s `/` operator. Why is pathlib preferred in modern Python?

Write a script that uses `pathlib` to rename all `.txt` files in a directory to `.md`.

How do you write a list of dictionaries to a CSV file without using Pandas (using the `csv` module)?

Explain how character encodings work in Python. Why should you often use `encoding='utf-8'` in `open()`?

Write code that safely creates a nested directory structure (e.g., `a/b/c`) if it doesn't exist.

What is the Pickle module? Why is `json` generally preferred over `pickle` for data serialization?

Explain the security risks of using `yaml.load()` or `pickle.loads()` on untrusted data.

Write code using the `tempfile` module to create a temporary file, write data, and auto-delete it.

Explain what the `file` variable is and how it's used to find relative asset paths.

Write a function that recursively finds all files larger than 1MB in a directory using `pathlib`.

Write code using `shutil` to copy a file and preserve its metadata.

Explain the purpose of `StringIO` and `BytesIO` in the `io` module. When would you use them?

How does Pandas `read_csv` differ fundamentally from the standard library `csv.reader`?

Write code to extract a ZIP file using the `zipfile` module or `shutil.unpack_archive`.