๐ Day 15 : File Handling
๐ฏ Enterprise Objective
Data Analyst pipelines start by reading data and end by saving data. Today we master Disk I/O, parsing the universal language of the web (JSON), and modernizing our code using the powerful, object-oriented pathlib module.
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | File I/O | open(), with context |
| 2 | JSON | loads(), dumps() |
| 3 | Pathlib | Object-oriented paths |
1. Reading & Writing Files : Disk I/O
Data lives in files. Python interacts with files using the open(filename, mode) function. Modes include 'r' (read), 'w' (write/overwrite), and 'a' (append). You should ALWAYS use a Context Manager (the with statement) to ensure the file is closed properly, even if an error occurs.
with open('data.txt', 'w') as f:
f.write('Hello World\n')
๐ผ Why Data Analysts Care
โข Log Processing: Reading multi-gigabyte log files line-by-line without running out of RAM
โข Data Exports: Saving analysis results to local text files
โ ๏ธ Memory Leaks
If you do f = open('data.txt') and forget to call f.close(), the file remains locked in memory. Always use with open(...) as f:.
๐งช Concept Checks: File I/O
Q1. Write code to open a file "test.txt" in write mode ("w") and write your name to it.
Q2. Open "test.txt" in read mode ("r"). Read the contents and print them.
Q3. Open "test.txt" in append mode ("a"). Add a new line "Welcome to Python". Print the full file again.
Q4. Explain why with open() as f: is superior to f = open(); f.read(); f.close().
Q5. Write a memory-efficient for loop to read a file line-by-line. (Assume file is "test.txt").
2. Parsing JSON : The Language of the Web
JSON (JavaScript Object Notation) is the universal format for web APIs. It maps perfectly to Python dictionaries and lists. The built-in json module provides tools to parse strings into dicts (loads) and serialize dicts into strings (dumps).
| Function | Purpose | Input -> Output |
|---|---|---|
json.loads(s) | Load String | String -> Dictionary |
json.dumps(d) | Dump String | Dictionary -> String |
json.load(f) | Load File | File Object -> Dictionary |
json.dump(d, f) | Dump File | Dictionary -> File Object |
๐ผ Why Data Analysts Care
โข API Integration: Parsing REST API responses (which are almost always JSON)
โข Configuration: Loading application settings from a .json file
๐ง Pro Tip
Use json.dumps(data, indent=4) to 'pretty-print' complex dictionaries for easy debugging.
๐งช Concept Checks: JSON
Q1. Import json. Use json.loads() to parse '{"x": 10, "y": 20}' into a dictionary.
Q2. Convert the dictionary {"color": "red", "sizes": [1, 2]} to a JSON string using json.dumps(). Print it.
Q3. Use json.dumps() with the indent=4 argument to pretty-print {"a": {"b": 1}}.
Q4. Write code to save a dictionary d directly to a file "data.json" using with open() and json.dump().
Q5. Read "data.json" back into a dictionary using json.load(). Print the type of the loaded object.
3. Pathlib : Modern File Paths
Handling file paths as strings (e.g., 'data/users.txt') causes bugs across different operating systems (Windows uses \, Mac/Linux use /). The modern Pythonic way is the pathlib module, which treats paths as objects.
from pathlib import Path
# Object-oriented paths
folder = Path('data')
file_path = folder / 'users.txt' # The / operator intelligently joins paths!
๐ผ Why Data Analysts Care
โข Cross-Platform Code: Write code on a Mac that executes flawlessly on a Windows server
โข File Operations: Easily check if a file exists, get its suffix (e.g., .csv), or read its text instantly
๐ง Pro Tip
Pathlib objects have amazing built-in methods: path.exists(), path.read_text(), and path.suffix. Use them instead of the older os.path module.
๐งช Concept Checks: Pathlib
Q1. Import Path from pathlib. Create a Path object for "folder" / "subfolder" / "file.csv". Print it.
Q2. Create a Path object p = Path("demo.txt"). Use p.write_text("Hello") to create the file.
Q3. Use p.read_text() to read the file created in Q2 and print it. Then check p.exists().
Q4. Create a Path for "image.jpg". Print its .suffix and .stem (the name without extension).
Q5. Use Path.cwd() to get the current working directory. Print it.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (Log Parser): Create a file server.log with 5 lines: 2 containing 'ERROR', 3 containing 'INFO'. Write a memory-efficient loop to read the file and print ONLY the 'ERROR' lines.
Task 2 (JSON Config Updater): Write a function update_config(file_path, key, val). It should read a JSON file (or create {} if missing), update the key, and save the JSON back to the file.
Task 3 (File Extension Counter): Create 3 files: a.txt, b.csv, c.txt in a new folder using Pathlib. Write a function that uses Path.iterdir() to iterate the folder and count how many .txt files exist.
Task 4 (CSV to JSON): Write a simulated CSV string (e.g., 'id,name\n1,Alice\n2,Bob'). Parse it manually using split('\n') and split(','), convert to a list of dicts, and json.dumps() it.
Task 5 (Safe File Reader): Write a function read_safe(path) that uses pathlib to check if a file exists. If so, return its text. If not, return None. Test with a valid and invalid path.
๐ป Pure Coding Interview Questions
Q1.
Explain the difference between open('f.txt', 'w') and open('f.txt', 'a').
Q2.
Why is it essential to use a context manager (with statement) when opening files?
Q3.
Explain the difference between f.read(), f.readline(), and f.readlines().
Q4.
How do you read a 50GB file in Python without running out of RAM?
Q5.
Explain the difference between json.loads() and json.load().
Q6.
Write code to parse a JSON string, extract a specific field, and handle a json.JSONDecodeError.
Q7.
Why shouldn't you use regular expressions to parse JSON or HTML?
Q8.
Compare os.path.join with pathlib's / operator. Why is pathlib preferred in modern Python?
Q9.
Write a script that uses pathlib to rename all .txt files in a directory to .md.
Q10.
How do you write a list of dictionaries to a CSV file without using Pandas (using the csv module)?
Q11.
Explain how character encodings work in Python. Why should you often use encoding='utf-8' in open()?
Q12.
Write code that safely creates a nested directory structure (e.g., a/b/c) if it doesn't exist.
Q13.
What is the Pickle module? Why is json generally preferred over pickle for data serialization?
Q14.
Write a generator function that reads a file and yields chunks of 1024 bytes at a time.
Q15.
Explain the security risks of using yaml.load() or pickle.loads() on untrusted data.
Q16.
Write code using the tempfile module to create a temporary file, write data, and auto-delete it.
Q17.
How do you handle file locking in Python if two processes try to write to the same file simultaneously?
Q18.
Explain what the file variable is and how it's used to find relative asset paths.
Q19.
Write a function that recursively finds all files larger than 1MB in a directory using pathlib.
Q20.
How do you handle reading a file that might be locked or currently being written to by another program?
Q21.
Write code using shutil to copy a file and preserve its metadata.
Q22.
Explain the purpose of StringIO and BytesIO in the io module. When would you use them?
Q23.
Write a script that merges 5 different JSON files into a single master JSON file.
Q24.
How does Pandas read_csv differ fundamentally from the standard library csv.reader?
Q25.
Write code to extract a ZIP file using the zipfile module or shutil.unpack_archive.
๐ Day 15 Executive Summary
| # | Topic | Key Takeaway |
|---|---|---|
| 1 | I/O | ALWAYS use context managers (with) |
| 2 | JSON | The bridge between Python dicts and the web |
| 3 | Pathlib | Replaces messy os.path strings |
โ Instructor's End-of-Day Checklist
โข [ ] I can safely open, read, and close a file.
โข [ ] I can parse a JSON string into a dictionary.
โข [ ] I can use pathlib to construct safe file paths.