โณ Loading Python Engine...

๐Ÿ“Š Day 03 : Strings

๐ŸŽฏ Enterprise Objective

Strings are the primary vehicle for text data โ€” the most common data type in real-world datasets. Today we master every aspect of string manipulation, from basic creation to performance optimization. You will learn not just string methods, but professional text processing patterns used in ETL pipelines and data cleaning.

๐Ÿ“‹ Strategic Overview

#TopicKey MethodsCore Use Case
1Creation & Escaping'', "", r'', '''File paths, SQL, multi-line
2Indexing & Slicings[0], s[1:4], s[::-1]Data extraction
3Core Methods.strip(), .replace(), .find()Data cleaning
4Split & Join.split(), ','.join()CSV/text parsing
5Formattingf-strings, .format()Reports, dashboards
6Validation.isdigit(), .isalpha()Input validation
7Encoding.encode(), .decode()APIs, file I/O
8Performancejoin() vs +=Efficient processing

1. String Creation & Escaping : Text Data Fundamentals

๐Ÿ” What is it?

A string is an immutable sequence of Unicode characters. You can create strings with single quotes '...', double quotes "...", or triple quotes '''...''' for multi-line text. Raw strings r'...' disable escape character processing.

SyntaxUse CaseExample
'hello'Simple textMost common
"it's"Text with apostrophesAvoids escaping
"""..."""Multi-line / docstringsSQL queries, docs
r'C:\path'Raw string (no escapes)File paths, regex

Common Escape Characters:

EscapeMeaningExample
\nNewline'Line1\nLine2'
\tTab'Col1\tCol2'
\\Backslash'C:\\Users'
\'Single quote'it\'s'

๐Ÿ’ผ Why Data Analysts Care

โ€ข CSV/JSON parsing: Understanding escape characters is essential for parsing data files

โ€ข SQL queries: Triple-quoted strings hold multi-line SQL cleanly

โ€ข File paths: Raw strings r'C:\data\file.csv' prevent escape issues on Windows

โš ๏ธ Immutability

Strings cannot be modified in place. Every operation like .upper() or + creates a new string object. This matters for performance in loops.

๐Ÿง  Pro Tip

Use triple-quoted strings for multi-line SQL queries: query = """SELECT * FROM users WHERE age > 18""". This keeps your code readable.

In [ ]:

๐Ÿงช Concept Checks: Creation

Q1. Create strings using all 4 methods: single quotes, double quotes, triple quotes, and raw string. Print each with its type() and len().

In [ ]:

Q2. Write a string containing: a newline, a tab, and a backslash. Print it and then print its repr() to see the escape characters.

In [ ]:

Q3. Prove string immutability: create s = "hello", save id(s), then do s += " world". Compare the IDs. What does this prove?

In [ ]:

Q4. Create a raw string for the Windows path C:\Users\Admin\Documents\data.csv. Then create the same path using escape characters. Verify they are equal.

In [ ]:

Q5. Write a multi-line SQL query using triple quotes: SELECT name, age FROM users WHERE age > 18 ORDER BY name. Print it.

In [ ]:

2. Indexing & Slicing : Precision Text Extraction

๐Ÿ” What is it?

Strings support zero-based indexing and negative indexing (from the end). Slicing extracts substrings using the syntax s[start:stop:step] where stop is exclusive.

text = 'PYTHON'
#       P  Y  T  H  O  N
#       0  1  2  3  4  5   (positive index)
#      -6 -5 -4 -3 -2 -1  (negative index)
OperationSyntaxResult
First chars[0]'P'
Last chars[-1]'N'
Slices[1:4]'YTH'
Reverses[::-1]'NOHTYP'
Every 2nds[::2]'PTO'

๐Ÿ’ผ Why Data Analysts Care

โ€ข Column extraction: Extract substrings from fixed-width data: record[0:10] for name field

โ€ข Data cleaning: phone[-10:] to extract last 10 digits of phone numbers

โ€ข Log parsing: Slice timestamps from log lines: log_line[:19] for ISO datetime

โš ๏ธ Off-by-One

The stop index is exclusive: 'PYTHON'[0:3] gives 'PYT' (indices 0, 1, 2), not 'PYTH'. This is a common source of bugs.

๐Ÿง  Pro Tip

Reverse a string with s[::-1]. Check for palindromes: s == s[::-1].

In [ ]:

๐Ÿงช Concept Checks: Indexing & Slicing

Q1. Given s = "Data Analytics", extract: first word, last word, every other character, and the reversed string. Print each.

In [ ]:

Q2. Given iso_date = "2024-12-25T14:30:00", use slicing to extract: year, month, day, hour, minute, second. Print a formatted result.

In [ ]:

Q3. Write code to check if word = "racecar" is a palindrome using slicing. Print the result with an explanation.

In [ ]:

Q4. Given phone = "+91-98765-43210", use slicing to extract just the 10-digit number (last 10 characters). Print it.

In [ ]:

Q5. Write code that reverses each word in sentence = "Hello World Python" but keeps word order. Use split() and slicing.

In [ ]:

3. Core String Methods : Search, Replace & Transform

๐Ÿ” What is it?

Python strings have 40+ built-in methods. The most important ones for data work are: upper(), lower(), strip(), replace(), find(), count(), startswith(), and endswith(). All return new strings (immutability).

MethodPurposeExample
.upper() / .lower()Case conversionStandardize text
.strip()Remove whitespaceClean user input
.replace(old, new)Replace substringsData correction
.find(sub)Find position (-1 if absent)Safe search
.count(sub)Count occurrencesFrequency analysis
.startswith()Check prefixFilter by pattern
.endswith()Check suffixFile type detection

๐Ÿ’ผ Why Data Analysts Care

โ€ข Data standardization: name.strip().title() โ€” clean and capitalize names

โ€ข Text cleaning: text.replace('\n', ' ').strip() โ€” normalize whitespace

โ€ข File filtering: if filename.endswith('.csv'): โ€” filter file types

โ€ข Search: .find() returns -1 instead of raising error (safer than .index())

โš ๏ธ find() vs index()

.find() returns -1 if not found. .index() raises ValueError. Always prefer .find() for safe searching.

๐Ÿง  Pro Tip

Chain methods for clean data pipelines: name.strip().lower().replace(' ', '_') converts ' John Doe ' to 'john_doe'.

In [ ]:

๐Ÿงช Concept Checks: String Methods

Q1. Given name = " alice SMITH ", clean it to produce "Alice Smith" using method chaining (strip, title). Print the result.

In [ ]:

Q2. Given csv_line = "John,25,Engineer,NYC", use .find() to locate the position of the second comma. Print the result.

In [ ]:

Q3. Count how many times the word "data" appears in text = "Data science uses data to derive data-driven insights" (case-insensitive). Print the count.

In [ ]:

Q4. Given a list of filenames ["report.csv", "image.png", "data.csv", "notes.txt"], use .endswith() to filter only .csv files. Print the result.

In [ ]:

Q5. Write code that replaces all spaces in "hello world python" with underscores, then converts to uppercase. Chain the methods in one line.

In [ ]:

4. Splitting & Joining : Text Decomposition & Assembly

๐Ÿ” What is it?
split() breaks a string into a list of substrings based on a delimiter. join() does the reverse โ€” it combines a list of strings into one string with a separator. These two methods are the backbone of text data processing.
# split: string โ†’ list
'a,b,c'.split(',')        # ['a', 'b', 'c']

# join: list โ†’ string
','.join(['a', 'b', 'c'])  # 'a,b,c'

๐Ÿ’ผ Why Data Analysts Care

โ€ข CSV parsing: row.split(',') โ€” manual CSV field extraction

โ€ข Log analysis: log_line.split() โ€” split on whitespace for field extraction

โ€ข Data export: ','.join(columns) โ€” build CSV rows for output

โ€ข Path building: '/'.join(['home', 'user', 'data']) โ€” construct file paths

โš ๏ธ split() with No Arguments

'a b c'.split() splits on any whitespace and removes empty strings. 'a b c'.split(' ') splits on single space only, producing ['a', '', 'b', '', 'c'].

๐Ÿง  Pro Tip

Use splitlines() for multi-line text: it handles \n, \r\n, and \r correctly across all platforms.

In [ ]:

๐Ÿงช Concept Checks: Split & Join

Q1. Given csv = "name,age,city,salary", split into a list, then rejoin with " | " as separator. Print both results.

In [ ]:

Q2. Given path = "/home/user/data/file.csv", split by "/" and extract just the filename. Print it.

In [ ]:

Q3. Split text = "one two three four" using both .split() and .split(" "). Print both results and explain the difference.

In [ ]:

Q4. Given words = ["SELECT", "name", "FROM", "users"], join them with spaces to build a SQL query string. Print it.

In [ ]:

Q5. Write code that reads a multi-line string (use triple quotes with 3 lines) and splits it into individual lines using .splitlines(). Print each line with its index.

In [ ]:

5. String Formatting : Professional Output Generation

๐Ÿ” What is it?

Python offers three formatting approaches: f-strings (Python 3.6+, fastest and most readable), .format(), and %-formatting (legacy). F-strings embed expressions directly inside {} braces.

Format SpecMeaningExampleResult
:.2f2 decimal placesf'{3.14159:.2f}'3.14
:,Thousands separatorf'{1000000:,}'1,000,000
:>10Right-align (width 10)f'{"hi":>10}'' hi'
:<10Left-alignf'{"hi":<10}''hi '
:^10Center-alignf'{"hi":^10}'' hi '
:.2%Percentagef'{0.856:.2%}'85.60%

๐Ÿ’ผ Why Data Analysts Care

โ€ข Report generation: Formatted tables, aligned columns, currency values

โ€ข Logging: f'[{timestamp}] {level}: {message}' โ€” structured log output

โ€ข Dashboard metrics: f'{revenue:,.2f}' โ€” professional number formatting

๐Ÿง  Pro Tip

F-strings can contain any Python expression: f'{len(data):,} records processed in {elapsed:.1f}s'. They are evaluated at runtime.

In [ ]:

๐Ÿงช Concept Checks: Formatting

Q1. Given revenue = 1234567.89, print it as currency with commas and 2 decimal places: $1,234,567.89. Use an f-string.

In [ ]:

Q2. Create a formatted table: print 3 products with name (left-aligned, 15 chars) and price (right-aligned, 8 chars, 2 decimals). Use f-string alignment.

In [ ]:

Q3. Given ratio = 0.8567, print it as a percentage with 1 decimal place: 85.7%. Use the % format specifier.

In [ ]:

Q4. Print the number 42 in binary, octal, and hexadecimal using f-string format specs (:b, :o, :x). Print all three.

In [ ]:

Q5. Write an f-string that embeds a conditional expression: print "Even" or "Odd" for n = 7 directly inside the f-string.

In [ ]:

6. String Validation Methods : Data Quality Checks

๐Ÿ” What is it?

Python strings have built-in validation methods that return True or False. These are essential for input validation and data quality checks before processing.

MethodReturns True if...Example
.isdigit()All characters are digits'123'.isdigit() โ†’ True
.isalpha()All characters are letters'abc'.isalpha() โ†’ True
.isalnum()Letters or digits only'abc123'.isalnum() โ†’ True
.isspace()All whitespace' '.isspace() โ†’ True
.isupper()All uppercase'ABC'.isupper() โ†’ True
.islower()All lowercase'abc'.islower() โ†’ True
.istitle()Title case'Hello World'.istitle() โ†’ True

๐Ÿ’ผ Why Data Analysts Care

โ€ข Input validation: if user_id.isdigit(): โ€” validate before int conversion

โ€ข Data cleaning: Filter rows where a column should be numeric but contains text

โ€ข ETL pipelines: Validate data quality before loading into databases

โš ๏ธ isdigit() vs isnumeric()

isdigit() only matches 0-9. isnumeric() also matches Unicode numerals like '\u00B2' (superscript 2). For data work, use isdigit() or try/except with int().
In [ ]:

๐Ÿงช Concept Checks: Validation

Q1. Given a list inputs = ["123", "12.5", "abc", "45", ""], use .isdigit() to filter only valid integers. Print the valid ones.

In [ ]:

Q2. Write a function validate_username(name) that returns True only if: length 3-20, alphanumeric only. Test with 5 examples.

In [ ]:

Q3. Given data = ["Hello", "WORLD", "mixedCase", "Title Case"], classify each as upper, lower, title, or mixed. Use .isupper(), .islower(), .istitle().

In [ ]:

Q4. Write code that checks if s = " \t\n " is all whitespace using .isspace(). Then check "" (empty string). What does empty return? Explain.

In [ ]:

Q5. Write a safe to_float(s) function that handles strings like "3.14", "-2.5", "abc", "". Return None for invalid inputs. Test with 5 cases.

In [ ]:

7. Encoding & Unicode : Global Text Processing

๐Ÿ” What is it?

Python 3 strings are Unicode by default (UTF-8). When working with files, APIs, or databases, you must handle encoding correctly. encode() converts str โ†’ bytes, decode() converts bytes โ†’ str.

text = 'Hello'
bytes_obj = text.encode('utf-8')   # b'Hello'
back = bytes_obj.decode('utf-8')   # 'Hello'

๐Ÿ’ผ Why Data Analysts Care

โ€ข API responses: JSON/REST APIs often return bytes that need decoding

โ€ข File I/O: open(file, encoding='utf-8') โ€” always specify encoding

โ€ข International data: Names, addresses, currencies in non-Latin scripts need proper Unicode handling

โš ๏ธ UnicodeDecodeError

Reading a file with wrong encoding causes UnicodeDecodeError. Always use encoding='utf-8' or detect encoding with libraries like chardet.

In [ ]:

๐Ÿงช Concept Checks: Encoding

Q1. Encode text = "Python" to UTF-8 bytes. Print the bytes object and its length. Then decode it back and verify equality.

In [ ]:

Q2. Compare the byte length of "A" vs "\u00C9" (accented E) vs a Chinese character "\u4e16" in UTF-8. Print each character and its byte count.

In [ ]:

Q3. Use ord() to print the Unicode code point of each character in "Hello". Then use chr() to reconstruct the string from code points.

In [ ]:

Q4. Write code that safely reads a string, trying UTF-8 first, then Latin-1 as fallback. Use try/except with .decode().

In [ ]:

Q5. Create a string with mixed scripts: English, numbers, and symbols. Print its len() (characters) and len(s.encode()) (bytes). Explain the difference.

In [ ]:

8. String Performance : Efficient Text Processing

๐Ÿ” What is it?

Since strings are immutable, concatenation in loops creates many temporary objects. For building large strings, use list + join or io.StringIO instead of +=. This can be 100x faster for large datasets.

ApproachSpeedMemoryUse When
+= in loopSlowHighNever for large data
''.join(list)FastLowBuilding strings in loops
f-stringsFastestLowSingle-line formatting
io.StringIOFastMediumStream-like building

๐Ÿ’ผ Why Data Analysts Care

โ€ข ETL pipelines: Building CSV output with join() instead of += saves minutes on large datasets

โ€ข Report generation: Use join() for assembling multi-line reports

โ€ข Memory management: Knowing string interning helps debug identity issues

๐Ÿง  Pro Tip

Python interns small strings and identifiers. 'hello' is 'hello' may be True due to caching, but never rely on this โ€” always use == for comparison.

In [ ]:

๐Ÿงช Concept Checks: Performance

Q1. Build a string of numbers "0,1,2,...,999" using: (a) += in a loop, (b) ",".join(). Time both approaches and print the speedup ratio.

In [ ]:

Q2. Demonstrate string interning: test a = "hello"; b = "hello"; print(a is b). Then test with a = "hello world". Explain the difference.

In [ ]:

Q3. Write code that builds a CSV string from data = [("Alice",25), ("Bob",30), ("Charlie",35)] using join(). Print the result.

In [ ]:

Q4. Use sys.getsizeof() to measure memory of: empty string, "a", "hello", "a"*1000. Print each size. What pattern do you notice?

In [ ]:

Q5. Write a function build_report(rows) that takes a list of dicts and returns a formatted table string using join(). Test with 3 sample rows.

In [ ]:

๐Ÿ› ๏ธ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Data Cleaner): Write a function clean_name(name) that: strips whitespace, converts to title case, replaces multiple spaces with single space, and removes non-alphabetic characters (except spaces). Test with ' john DOE 3rd '.

In [ ]:

Task 2 (CSV Parser): Write a function parse_csv_line(line) that splits a CSV line by commas, strips each field, and returns a list. Handle edge case: fields containing commas inside quotes. Test with 'Alice, 28, "New York, NY"'.

In [ ]:

Task 3 (Log Analyzer): Given log = "2024-01-15 14:30:22 ERROR Database connection failed", extract: date, time, level, message using string methods only (no regex). Print each part.

In [ ]:

Task 4 (Email Validator): Write a function validate_email(email) that checks: contains exactly one @, has text before and after @, domain has a dot, no spaces. Return True/False. Test with 5 valid and 5 invalid emails.

In [ ]:

Task 5 (Text Statistics): Write a function text_stats(text) that returns a dict with: character count, word count, sentence count, average word length, most common word. Test with a paragraph of text.

In [ ]:

๐Ÿ’ป Pure Coding Interview Questions

Q1.

Write a function reverse_words(s) that reverses word order: 'hello world' โ†’ 'world hello'. Do NOT reverse individual characters.

In [ ]:

Q2.

Write a function is_anagram(s1, s2) that checks if two strings are anagrams (case-insensitive, ignoring spaces). Test with 'listen' and 'silent'.

In [ ]:

Q3.

Write a function compress(s) implementing run-length encoding: 'aabcccdd' โ†’ 'a2b1c3d2'. Only compress if result is shorter.

In [ ]:

Q4.

Write a function first_non_repeating(s) that finds the first non-repeating character. 'aabbc' โ†’ 'c'. Return None if all repeat.

In [ ]:

Q5.

Write a function caesar_cipher(text, shift) that shifts each letter by shift positions. Handle wrapping (zโ†’a) and preserve non-letters.

In [ ]:

Q6.

Write a function longest_common_prefix(strs) that finds the longest common prefix in a list of strings. ['flower','flow','flight'] โ†’ 'fl'.

In [ ]:

Q7.

Write a function valid_parentheses(s) that checks if brackets are balanced: '([{}])' โ†’ True, '([)]' โ†’ False.

In [ ]:

Q8.

Write a function count_vowels(s) that returns a dict of vowel frequencies (case-insensitive). Test with a sentence.

In [ ]:

Q9.

Write a function title_case(s) that capitalizes the first letter of each word, except articles (a, an, the). First word always capitalized.

In [ ]:

Q10.

Write a function remove_duplicates(s) that removes duplicate characters preserving order: 'abcabc' โ†’ 'abc'.

In [ ]:

Q11.

Write a function zigzag(s, rows) that converts text to zigzag pattern and reads row by row. 'PAYPALISHIRING' with 3 rows โ†’ 'PAHNAPLSIIGYIR'.

In [ ]:

Q12.

Write a function word_pattern(pattern, s) that checks if string follows pattern: pattern='abba', s='dog cat cat dog' โ†’ True.

In [ ]:

Q13.

Write a function group_anagrams(words) that groups anagrams together. ['eat','tea','tan','ate','nat','bat'] โ†’ grouped lists.

In [ ]:

Q14.

Write a function find_overlapping(s, sub) that counts all overlapping occurrences of sub in s. E.g., find_overlapping('aaa', 'aa') returns 2.

In [ ]:

Q15.

Write a function pad_number(n, width) that pads a number with leading zeros to the given width. E.g., pad_number(42, 5) returns '00042'.

In [ ]:

Q16.

Write code to implement str.replace() from scratch: my_replace(text, old, new). Handle overlapping patterns.

In [ ]:

Q17.

Write a function repeat_chars(s, n) that repeats each character n times: repeat_chars('abc', 3) returns 'aaabbbccc'.

In [ ]:

Q18.

Write a function longest_palindrome_substring(s) that finds the longest palindromic substring in a string.

In [ ]:

Q19.

Write a function atoi(s) that converts string to integer handling: whitespace, signs, overflow, invalid chars. Mimic int() behavior.

In [ ]:

Q20.

Write a function justify_text(text, width) that fully justifies text to given width by distributing spaces evenly between words.

In [ ]:

Q21.

Write a function compare_version(v1, v2) that compares version strings: '1.2.3' vs '1.2.4' โ†’ -1. Handle different lengths.

In [ ]:

Q22.

Write a function interleave(s1, s2) that interleaves two strings: 'abc','xyz' โ†’ 'axbycz'. Handle different lengths.

In [ ]:

Q23.

Write a function count_substrings(s, sub) that counts overlapping occurrences: 'aaa' contains 'aa' twice.

In [ ]:

Q24.

Write a function to_snake_case(s) converting 'camelCaseString' โ†’ 'camel_case_string'. Handle consecutive capitals.

In [ ]:

Q25.

Write a function expand_range(s) that expands: '1-5,8,11-14' โ†’ [1,2,3,4,5,8,11,12,13,14].

In [ ]:

๐Ÿ“Š Day 3 Executive Summary

#TopicKey TakeawayProfessional Application
1Creation4 ways to create; strings are immutableFile paths, SQL queries
2Indexing & SlicingZero-based; stop is exclusive; [::-1] reversesLog parsing, data extraction
3Core Methods.strip(), .replace(), .find() โ€” chain themData cleaning pipelines
4Split & Joinsplit() โ†’ list; join() โ†’ stringCSV/text parsing
5Formattingf-strings are fastest and most readableReports, dashboards
6Validation.isdigit(), .isalpha() for quality checksInput validation, ETL
7EncodingUTF-8 default; encode()/decode() for bytesAPIs, file I/O
8Performancejoin() >> += for loop concatenationLarge-scale text processing

โœ… Instructor's End-of-Day Checklist

โ€ข [ ] I understand string immutability and its performance implications.

โ€ข [ ] I can use slicing to extract substrings efficiently.

โ€ข [ ] I know the difference between .find() (safe) and .index() (raises error).

โ€ข [ ] I can use f-strings with format specs for professional output.

โ€ข [ ] I understand encoding and can handle UTF-8/bytes conversion.

โ€ข [ ] I have completed all 5 practice tasks.

โ€ข [ ] I have reviewed all 25 interview questions.