๐ Day 03 : Strings
๐ฏ Enterprise Objective
Strings are the primary vehicle for text data โ the most common data type in real-world datasets. Today we master every aspect of string manipulation, from basic creation to performance optimization. You will learn not just string methods, but professional text processing patterns used in ETL pipelines and data cleaning.
๐ Strategic Overview
| # | Topic | Key Methods | Core Use Case |
|---|---|---|---|
| 1 | Creation & Escaping | '', "", r'', ''' | File paths, SQL, multi-line |
| 2 | Indexing & Slicing | s[0], s[1:4], s[::-1] | Data extraction |
| 3 | Core Methods | .strip(), .replace(), .find() | Data cleaning |
| 4 | Split & Join | .split(), ','.join() | CSV/text parsing |
| 5 | Formatting | f-strings, .format() | Reports, dashboards |
| 6 | Validation | .isdigit(), .isalpha() | Input validation |
| 7 | Encoding | .encode(), .decode() | APIs, file I/O |
| 8 | Performance | join() vs += | Efficient processing |
1. String Creation & Escaping : Text Data Fundamentals
A string is an immutable sequence of Unicode characters. You can create strings with single quotes '...', double quotes "...", or triple quotes '''...''' for multi-line text. Raw strings r'...' disable escape character processing.
| Syntax | Use Case | Example |
|---|---|---|
'hello' | Simple text | Most common |
"it's" | Text with apostrophes | Avoids escaping |
"""...""" | Multi-line / docstrings | SQL queries, docs |
r'C:\path' | Raw string (no escapes) | File paths, regex |
Common Escape Characters:
| Escape | Meaning | Example |
|---|---|---|
\n | Newline | 'Line1\nLine2' |
\t | Tab | 'Col1\tCol2' |
\\ | Backslash | 'C:\\Users' |
\' | Single quote | 'it\'s' |
๐ผ Why Data Analysts Care
โข CSV/JSON parsing: Understanding escape characters is essential for parsing data files
โข SQL queries: Triple-quoted strings hold multi-line SQL cleanly
โข File paths: Raw strings r'C:\data\file.csv' prevent escape issues on Windows
โ ๏ธ Immutability
Strings cannot be modified in place. Every operation like .upper() or + creates a new string object. This matters for performance in loops.
๐ง Pro Tip
Use triple-quoted strings for multi-line SQL queries: query = """SELECT * FROM users WHERE age > 18""". This keeps your code readable.
๐งช Concept Checks: Creation
Q1. Create strings using all 4 methods: single quotes, double quotes, triple quotes, and raw string. Print each with its type() and len().
Q2. Write a string containing: a newline, a tab, and a backslash. Print it and then print its repr() to see the escape characters.
Q3. Prove string immutability: create s = "hello", save id(s), then do s += " world". Compare the IDs. What does this prove?
Q4. Create a raw string for the Windows path C:\Users\Admin\Documents\data.csv. Then create the same path using escape characters. Verify they are equal.
Q5. Write a multi-line SQL query using triple quotes: SELECT name, age FROM users WHERE age > 18 ORDER BY name. Print it.
2. Indexing & Slicing : Precision Text Extraction
Strings support zero-based indexing and negative indexing (from the end). Slicing extracts substrings using the syntax s[start:stop:step] where stop is exclusive.
text = 'PYTHON'
# P Y T H O N
# 0 1 2 3 4 5 (positive index)
# -6 -5 -4 -3 -2 -1 (negative index)
| Operation | Syntax | Result |
|---|---|---|
| First char | s[0] | 'P' |
| Last char | s[-1] | 'N' |
| Slice | s[1:4] | 'YTH' |
| Reverse | s[::-1] | 'NOHTYP' |
| Every 2nd | s[::2] | 'PTO' |
๐ผ Why Data Analysts Care
โข Column extraction: Extract substrings from fixed-width data: record[0:10] for name field
โข Data cleaning: phone[-10:] to extract last 10 digits of phone numbers
โข Log parsing: Slice timestamps from log lines: log_line[:19] for ISO datetime
โ ๏ธ Off-by-One
The stop index is exclusive: 'PYTHON'[0:3] gives 'PYT' (indices 0, 1, 2), not 'PYTH'. This is a common source of bugs.
๐ง Pro Tip
Reverse a string with s[::-1]. Check for palindromes: s == s[::-1].
๐งช Concept Checks: Indexing & Slicing
Q1. Given s = "Data Analytics", extract: first word, last word, every other character, and the reversed string. Print each.
Q2. Given iso_date = "2024-12-25T14:30:00", use slicing to extract: year, month, day, hour, minute, second. Print a formatted result.
Q3. Write code to check if word = "racecar" is a palindrome using slicing. Print the result with an explanation.
Q4. Given phone = "+91-98765-43210", use slicing to extract just the 10-digit number (last 10 characters). Print it.
Q5. Write code that reverses each word in sentence = "Hello World Python" but keeps word order. Use split() and slicing.
3. Core String Methods : Search, Replace & Transform
Python strings have 40+ built-in methods. The most important ones for data work are: upper(), lower(), strip(), replace(), find(), count(), startswith(), and endswith(). All return new strings (immutability).
| Method | Purpose | Example |
|---|---|---|
.upper() / .lower() | Case conversion | Standardize text |
.strip() | Remove whitespace | Clean user input |
.replace(old, new) | Replace substrings | Data correction |
.find(sub) | Find position (-1 if absent) | Safe search |
.count(sub) | Count occurrences | Frequency analysis |
.startswith() | Check prefix | Filter by pattern |
.endswith() | Check suffix | File type detection |
๐ผ Why Data Analysts Care
โข Data standardization: name.strip().title() โ clean and capitalize names
โข Text cleaning: text.replace('\n', ' ').strip() โ normalize whitespace
โข File filtering: if filename.endswith('.csv'): โ filter file types
โข Search: .find() returns -1 instead of raising error (safer than .index())
โ ๏ธ find() vs index()
.find() returns -1 if not found. .index() raises ValueError. Always prefer .find() for safe searching.
๐ง Pro Tip
Chain methods for clean data pipelines: name.strip().lower().replace(' ', '_') converts ' John Doe ' to 'john_doe'.
๐งช Concept Checks: String Methods
Q1. Given name = " alice SMITH ", clean it to produce "Alice Smith" using method chaining (strip, title). Print the result.
Q2. Given csv_line = "John,25,Engineer,NYC", use .find() to locate the position of the second comma. Print the result.
Q3. Count how many times the word "data" appears in text = "Data science uses data to derive data-driven insights" (case-insensitive). Print the count.
Q4. Given a list of filenames ["report.csv", "image.png", "data.csv", "notes.txt"], use .endswith() to filter only .csv files. Print the result.
Q5. Write code that replaces all spaces in "hello world python" with underscores, then converts to uppercase. Chain the methods in one line.
4. Splitting & Joining : Text Decomposition & Assembly
split() breaks a string into a list of substrings based on a delimiter. join() does the reverse โ it combines a list of strings into one string with a separator. These two methods are the backbone of text data processing.
# split: string โ list
'a,b,c'.split(',') # ['a', 'b', 'c']
# join: list โ string
','.join(['a', 'b', 'c']) # 'a,b,c'
๐ผ Why Data Analysts Care
โข CSV parsing: row.split(',') โ manual CSV field extraction
โข Log analysis: log_line.split() โ split on whitespace for field extraction
โข Data export: ','.join(columns) โ build CSV rows for output
โข Path building: '/'.join(['home', 'user', 'data']) โ construct file paths
โ ๏ธ split() with No Arguments
'a b c'.split() splits on any whitespace and removes empty strings. 'a b c'.split(' ') splits on single space only, producing ['a', '', 'b', '', 'c'].
๐ง Pro Tip
Use splitlines() for multi-line text: it handles \n, \r\n, and \r correctly across all platforms.
๐งช Concept Checks: Split & Join
Q1. Given csv = "name,age,city,salary", split into a list, then rejoin with " | " as separator. Print both results.
Q2. Given path = "/home/user/data/file.csv", split by "/" and extract just the filename. Print it.
Q3. Split text = "one two three four" using both .split() and .split(" "). Print both results and explain the difference.
Q4. Given words = ["SELECT", "name", "FROM", "users"], join them with spaces to build a SQL query string. Print it.
Q5. Write code that reads a multi-line string (use triple quotes with 3 lines) and splits it into individual lines using .splitlines(). Print each line with its index.
5. String Formatting : Professional Output Generation
Python offers three formatting approaches: f-strings (Python 3.6+, fastest and most readable), .format(), and %-formatting (legacy). F-strings embed expressions directly inside {} braces.
| Format Spec | Meaning | Example | Result |
|---|---|---|---|
:.2f | 2 decimal places | f'{3.14159:.2f}' | 3.14 |
:, | Thousands separator | f'{1000000:,}' | 1,000,000 |
:>10 | Right-align (width 10) | f'{"hi":>10}' | ' hi' |
:<10 | Left-align | f'{"hi":<10}' | 'hi ' |
:^10 | Center-align | f'{"hi":^10}' | ' hi ' |
:.2% | Percentage | f'{0.856:.2%}' | 85.60% |
๐ผ Why Data Analysts Care
โข Report generation: Formatted tables, aligned columns, currency values
โข Logging: f'[{timestamp}] {level}: {message}' โ structured log output
โข Dashboard metrics: f'{revenue:,.2f}' โ professional number formatting
๐ง Pro Tip
F-strings can contain any Python expression: f'{len(data):,} records processed in {elapsed:.1f}s'. They are evaluated at runtime.
๐งช Concept Checks: Formatting
Q1. Given revenue = 1234567.89, print it as currency with commas and 2 decimal places: $1,234,567.89. Use an f-string.
Q2. Create a formatted table: print 3 products with name (left-aligned, 15 chars) and price (right-aligned, 8 chars, 2 decimals). Use f-string alignment.
Q3. Given ratio = 0.8567, print it as a percentage with 1 decimal place: 85.7%. Use the % format specifier.
Q4. Print the number 42 in binary, octal, and hexadecimal using f-string format specs (:b, :o, :x). Print all three.
Q5. Write an f-string that embeds a conditional expression: print "Even" or "Odd" for n = 7 directly inside the f-string.
6. String Validation Methods : Data Quality Checks
Python strings have built-in validation methods that return True or False. These are essential for input validation and data quality checks before processing.
| Method | Returns True if... | Example |
|---|---|---|
.isdigit() | All characters are digits | '123'.isdigit() โ True |
.isalpha() | All characters are letters | 'abc'.isalpha() โ True |
.isalnum() | Letters or digits only | 'abc123'.isalnum() โ True |
.isspace() | All whitespace | ' '.isspace() โ True |
.isupper() | All uppercase | 'ABC'.isupper() โ True |
.islower() | All lowercase | 'abc'.islower() โ True |
.istitle() | Title case | 'Hello World'.istitle() โ True |
๐ผ Why Data Analysts Care
โข Input validation: if user_id.isdigit(): โ validate before int conversion
โข Data cleaning: Filter rows where a column should be numeric but contains text
โข ETL pipelines: Validate data quality before loading into databases
โ ๏ธ isdigit() vs isnumeric()
isdigit() only matches 0-9. isnumeric() also matches Unicode numerals like '\u00B2' (superscript 2). For data work, use isdigit() or try/except with int().
๐งช Concept Checks: Validation
Q1. Given a list inputs = ["123", "12.5", "abc", "45", ""], use .isdigit() to filter only valid integers. Print the valid ones.
Q2. Write a function validate_username(name) that returns True only if: length 3-20, alphanumeric only. Test with 5 examples.
Q3. Given data = ["Hello", "WORLD", "mixedCase", "Title Case"], classify each as upper, lower, title, or mixed. Use .isupper(), .islower(), .istitle().
Q4. Write code that checks if s = " \t\n " is all whitespace using .isspace(). Then check "" (empty string). What does empty return? Explain.
Q5. Write a safe to_float(s) function that handles strings like "3.14", "-2.5", "abc", "". Return None for invalid inputs. Test with 5 cases.
7. Encoding & Unicode : Global Text Processing
Python 3 strings are Unicode by default (UTF-8). When working with files, APIs, or databases, you must handle encoding correctly. encode() converts str โ bytes, decode() converts bytes โ str.
text = 'Hello'
bytes_obj = text.encode('utf-8') # b'Hello'
back = bytes_obj.decode('utf-8') # 'Hello'
๐ผ Why Data Analysts Care
โข API responses: JSON/REST APIs often return bytes that need decoding
โข File I/O: open(file, encoding='utf-8') โ always specify encoding
โข International data: Names, addresses, currencies in non-Latin scripts need proper Unicode handling
โ ๏ธ UnicodeDecodeError
Reading a file with wrong encoding causes UnicodeDecodeError. Always use encoding='utf-8' or detect encoding with libraries like chardet.
๐งช Concept Checks: Encoding
Q1. Encode text = "Python" to UTF-8 bytes. Print the bytes object and its length. Then decode it back and verify equality.
Q2. Compare the byte length of "A" vs "\u00C9" (accented E) vs a Chinese character "\u4e16" in UTF-8. Print each character and its byte count.
Q3. Use ord() to print the Unicode code point of each character in "Hello". Then use chr() to reconstruct the string from code points.
Q4. Write code that safely reads a string, trying UTF-8 first, then Latin-1 as fallback. Use try/except with .decode().
Q5. Create a string with mixed scripts: English, numbers, and symbols. Print its len() (characters) and len(s.encode()) (bytes). Explain the difference.
8. String Performance : Efficient Text Processing
Since strings are immutable, concatenation in loops creates many temporary objects. For building large strings, use list + join or io.StringIO instead of +=. This can be 100x faster for large datasets.
| Approach | Speed | Memory | Use When |
|---|---|---|---|
+= in loop | Slow | High | Never for large data |
''.join(list) | Fast | Low | Building strings in loops |
| f-strings | Fastest | Low | Single-line formatting |
io.StringIO | Fast | Medium | Stream-like building |
๐ผ Why Data Analysts Care
โข ETL pipelines: Building CSV output with join() instead of += saves minutes on large datasets
โข Report generation: Use join() for assembling multi-line reports
โข Memory management: Knowing string interning helps debug identity issues
๐ง Pro Tip
Python interns small strings and identifiers. 'hello' is 'hello' may be True due to caching, but never rely on this โ always use == for comparison.
๐งช Concept Checks: Performance
Q1. Build a string of numbers "0,1,2,...,999" using: (a) += in a loop, (b) ",".join(). Time both approaches and print the speedup ratio.
Q2. Demonstrate string interning: test a = "hello"; b = "hello"; print(a is b). Then test with a = "hello world". Explain the difference.
Q3. Write code that builds a CSV string from data = [("Alice",25), ("Bob",30), ("Charlie",35)] using join(). Print the result.
Q4. Use sys.getsizeof() to measure memory of: empty string, "a", "hello", "a"*1000. Print each size. What pattern do you notice?
Q5. Write a function build_report(rows) that takes a list of dicts and returns a formatted table string using join(). Test with 3 sample rows.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (Data Cleaner): Write a function clean_name(name) that: strips whitespace, converts to title case, replaces multiple spaces with single space, and removes non-alphabetic characters (except spaces). Test with ' john DOE 3rd '.
Task 2 (CSV Parser): Write a function parse_csv_line(line) that splits a CSV line by commas, strips each field, and returns a list. Handle edge case: fields containing commas inside quotes. Test with 'Alice, 28, "New York, NY"'.
Task 3 (Log Analyzer): Given log = "2024-01-15 14:30:22 ERROR Database connection failed", extract: date, time, level, message using string methods only (no regex). Print each part.
Task 4 (Email Validator): Write a function validate_email(email) that checks: contains exactly one @, has text before and after @, domain has a dot, no spaces. Return True/False. Test with 5 valid and 5 invalid emails.
Task 5 (Text Statistics): Write a function text_stats(text) that returns a dict with: character count, word count, sentence count, average word length, most common word. Test with a paragraph of text.
๐ป Pure Coding Interview Questions
Q1.
Write a function reverse_words(s) that reverses word order: 'hello world' โ 'world hello'. Do NOT reverse individual characters.
Q2.
Write a function is_anagram(s1, s2) that checks if two strings are anagrams (case-insensitive, ignoring spaces). Test with 'listen' and 'silent'.
Q3.
Write a function compress(s) implementing run-length encoding: 'aabcccdd' โ 'a2b1c3d2'. Only compress if result is shorter.
Q4.
Write a function first_non_repeating(s) that finds the first non-repeating character. 'aabbc' โ 'c'. Return None if all repeat.
Q5.
Write a function caesar_cipher(text, shift) that shifts each letter by shift positions. Handle wrapping (zโa) and preserve non-letters.
Q6.
Write a function longest_common_prefix(strs) that finds the longest common prefix in a list of strings. ['flower','flow','flight'] โ 'fl'.
Q7.
Write a function valid_parentheses(s) that checks if brackets are balanced: '([{}])' โ True, '([)]' โ False.
Q8.
Write a function count_vowels(s) that returns a dict of vowel frequencies (case-insensitive). Test with a sentence.
Q9.
Write a function title_case(s) that capitalizes the first letter of each word, except articles (a, an, the). First word always capitalized.
Q10.
Write a function remove_duplicates(s) that removes duplicate characters preserving order: 'abcabc' โ 'abc'.
Q11.
Write a function zigzag(s, rows) that converts text to zigzag pattern and reads row by row. 'PAYPALISHIRING' with 3 rows โ 'PAHNAPLSIIGYIR'.
Q12.
Write a function word_pattern(pattern, s) that checks if string follows pattern: pattern='abba', s='dog cat cat dog' โ True.
Q13.
Write a function group_anagrams(words) that groups anagrams together. ['eat','tea','tan','ate','nat','bat'] โ grouped lists.
Q14.
Write a function find_overlapping(s, sub) that counts all overlapping occurrences of sub in s. E.g., find_overlapping('aaa', 'aa') returns 2.
Q15.
Write a function pad_number(n, width) that pads a number with leading zeros to the given width. E.g., pad_number(42, 5) returns '00042'.
Q16.
Write code to implement str.replace() from scratch: my_replace(text, old, new). Handle overlapping patterns.
Q17.
Write a function repeat_chars(s, n) that repeats each character n times: repeat_chars('abc', 3) returns 'aaabbbccc'.
Q18.
Write a function longest_palindrome_substring(s) that finds the longest palindromic substring in a string.
Q19.
Write a function atoi(s) that converts string to integer handling: whitespace, signs, overflow, invalid chars. Mimic int() behavior.
Q20.
Write a function justify_text(text, width) that fully justifies text to given width by distributing spaces evenly between words.
Q21.
Write a function compare_version(v1, v2) that compares version strings: '1.2.3' vs '1.2.4' โ -1. Handle different lengths.
Q22.
Write a function interleave(s1, s2) that interleaves two strings: 'abc','xyz' โ 'axbycz'. Handle different lengths.
Q23.
Write a function count_substrings(s, sub) that counts overlapping occurrences: 'aaa' contains 'aa' twice.
Q24.
Write a function to_snake_case(s) converting 'camelCaseString' โ 'camel_case_string'. Handle consecutive capitals.
Q25.
Write a function expand_range(s) that expands: '1-5,8,11-14' โ [1,2,3,4,5,8,11,12,13,14].
๐ Day 3 Executive Summary
| # | Topic | Key Takeaway | Professional Application |
|---|---|---|---|
| 1 | Creation | 4 ways to create; strings are immutable | File paths, SQL queries |
| 2 | Indexing & Slicing | Zero-based; stop is exclusive; [::-1] reverses | Log parsing, data extraction |
| 3 | Core Methods | .strip(), .replace(), .find() โ chain them | Data cleaning pipelines |
| 4 | Split & Join | split() โ list; join() โ string | CSV/text parsing |
| 5 | Formatting | f-strings are fastest and most readable | Reports, dashboards |
| 6 | Validation | .isdigit(), .isalpha() for quality checks | Input validation, ETL |
| 7 | Encoding | UTF-8 default; encode()/decode() for bytes | APIs, file I/O |
| 8 | Performance | join() >> += for loop concatenation | Large-scale text processing |
โ Instructor's End-of-Day Checklist
โข [ ] I understand string immutability and its performance implications.
โข [ ] I can use slicing to extract substrings efficiently.
โข [ ] I know the difference between .find() (safe) and .index() (raises error).
โข [ ] I can use f-strings with format specs for professional output.
โข [ ] I understand encoding and can handle UTF-8/bytes conversion.
โข [ ] I have completed all 5 practice tasks.
โข [ ] I have reviewed all 25 interview questions.