๐ Day 06 : Sets
๐ฏ Enterprise Objective
Sets are the secret weapon of efficient Python code. While lists are for sequences, sets are for membership and uniqueness. Today we master set theory operations and discover how switching from a list to a set can speed up your data processing pipelines by 1000x.
๐ Strategic Overview
| # | Topic | Key Concept | Core Use Case | |
|---|---|---|---|---|
| 1 | Creation | Unordered, unique, {} or set() | Deduplication | |
| 2 | Operations | `\ | , &, -, ^` | Cohort Analysis |
| 3 | Methods | add(), discard() | Dynamic updating | |
| 4 | Relations | A <= B, issubset() | Validation | |
| 5 | Frozen Sets | frozenset(), immutable | Dict keys, sets of sets | |
| 6 | Performance | O(1) vs O(N) | Pipeline optimization |
1. Set Creation & Basics : Unordered, Unique Elements
A set is an unordered collection of unique, hashable elements. Sets are defined by curly braces {} (but without key-value pairs) or the set() constructor. Their superpower is fast O(1) membership testing and automatic deduplication.
| Creation Method | Example | Note |
|---|---|---|
| Curly braces | {1, 2, 3} | Values must be hashable |
set() constructor | set([1, 2, 2, 3]) | Automatically deduplicates |
| Empty set | set() | {} creates an empty dict! |
| Comprehension | {x**2 for x in range(5)} | Set comprehension |
๐ผ Why Data Analysts Care
โข Data Deduplication: unique_users = set(user_ids) โ instant deduplication
โข Membership Testing: Checking if an item exists is exponentially faster than lists
โข Vocabulary Building: Finding unique words in a document corpus
โ ๏ธ Empty Set Trap
{} creates an empty dictionary, not an empty set. You must use set() to create an empty set.
๐ง Pro Tip
Sets can only contain hashable (immutable) items. You cannot put a list or a dictionary inside a set, but you can put a tuple.
๐งช Concept Checks: Set Creation
Q1. Create a set of your 3 favorite colors. Then create an empty set. Print their types.
Q2. Convert the string "abracadabra" to a set. Print it. Explain why the length is much shorter.
Q3. Try to create a set containing a list: s = {1, 2, [3, 4]}. Catch the TypeError and print the error message (unhashable type).
Q4. Create a set containing a tuple: s = {1, 2, (3, 4)}. Print the set. Why does this work but lists do not?
Q5. Use a set comprehension to create a set of the lengths of these words: ["data", "science", "python", "data", "code"]. Print it.
2. Set Theory Operations : Mathematical Relational Logic
Python sets support mathematical operations like union, intersection, difference, and symmetric difference. These can be performed using operators (e.g., |, &) or methods (e.g., .union()). Operators require both operands to be sets, while methods accept any iterable.
| Operation | Operator | Method | Meaning | |
|---|---|---|---|---|
| Union | `A \ | B` | A.union(B) | Elements in A or B (or both) |
| Intersection | A & B | A.intersection(B) | Elements in both A and B | |
| Difference | A - B | A.difference(B) | Elements in A but not in B | |
| Symmetric Diff | A ^ B | A.symmetric_difference(B) | Elements in exactly one set |
๐ผ Why Data Analysts Care
โข Cohort Analysis: Find users who bought Product A AND Product B (Intersection)
โข Churn Analysis: Users active last month MINUS users active this month (Difference)
โข A/B Testing: Combine unique user IDs from two experiment arms (Union)
โ ๏ธ Order Matters in Difference
A - B is not the same as B - A. The difference operation is asymmetric.
๐งช Concept Checks: Set Operations
Q1. Given engineers = {"Alice", "Bob", "Charlie"} and managers = {"Charlie", "David", "Eve"}, find who is both an engineer and a manager.
Q2. Using the same sets, find all unique employees across both roles (Union). Print the result.
Q3. Find employees who are engineers but NOT managers. Find employees who are managers but NOT engineers. Print both.
Q4. Find employees who hold exactly ONE role (not both). Print the symmetric difference.
Q5. Demonstrate that operators require sets: try {1, 2} | [2, 3]. Then show how to fix it using the .union() method.
3. Set Methods & Modification : Dynamic Set Updates
Sets are mutable. You can add or remove elements using methods. Unlike lists, sets have no concept of index or order, so you cannot use append() or pop(i).
| Method | Action | Note |
|---|---|---|
.add(x) | Add one item | Does nothing if already exists |
.update(iter) | Add multiple items | Like .extend() for lists |
.remove(x) | Remove item x | Raises KeyError if x not found |
.discard(x) | Remove item x | Does nothing if x not found (safe) |
.pop() | Remove & return random item | Raises KeyError if empty |
.clear() | Remove all items | Leaves set() |
๐ผ Why Data Analysts Care
โข Incremental Processing: Add seen IDs to a set to track processed records: seen.add(row_id)
โข Safe Cleanup: Use .discard() to remove elements without needing if item in set: checks
โ ๏ธ remove() vs discard()
Always use .discard() unless you specifically want your program to crash when trying to remove an element that isn't there.
๐งช Concept Checks: Set Methods
Q1. Start with s = set(). Add elements 1, 2, and 2 again. Print the set.
Q2. Update the set with a list [3, 4, 5] and a string "hello". Print the result. Notice how the string is handled.
Q3. Demonstrate the difference between .remove() and .discard() by trying to remove an element that does not exist. Use a try/except block for .remove().
Q4. Use a while loop and .pop() to empty a set of 5 elements, printing each popped item. Print the final set to confirm it is empty.
Q5. Write a function add_if_missing(s, item) that adds an item and returns True if it was added, or False if it was already there.
4. Subset & Superset Relations : Comparative Logic
Sets can be compared to check if one is fully contained within another. A set is a subset if all its elements exist in the other set. It is a superset if it contains all elements of the other set. Two sets are disjoint if they share no common elements.
| Operation | Operator | Method | True if... |
|---|---|---|---|
| Subset | A <= B | A.issubset(B) | All elements of A are in B |
| Proper Subset | A < B | None | Subset AND A != B |
| Superset | A >= B | A.issuperset(B) | A contains all elements of B |
| Disjoint | None | A.isdisjoint(B) | A & B is empty |
๐ผ Why Data Analysts Care
โข Permission Checking: if user_roles.issubset(allowed_roles): โ verify access rights
โข Data Validation: if required_columns.issubset(df.columns): โ ensure data structure is correct
โข Overlap Checking: if isdisjoint โ verify two user segments are truly mutually exclusive
๐ง Pro Tip
Use <= and >= for subsets/supersets. A < B checks for a proper subset (A is a subset of B, but A is not equal to B).
๐งช Concept Checks: Subset & Superset
Q1. Given required = {"age", "name"}, check if it is a subset of columns = {"age", "name", "salary"} using both method and operator.
Q2. Check if columns is a superset of required using both method and operator. Print results.
Q3. Demonstrate the difference between a subset (<=) and a proper subset (<) using A = {1, 2} and B = {1, 2}.
Q4. Create two sets of tags for two different articles. Check if they have zero overlap using .isdisjoint(). Print the result.
Q5. Write a function has_all_permissions(user_perms, required_perms) that returns True only if the user has all required permissions. Test it.
5. Frozen Sets : Immutable Sets
A frozenset is an immutable version of a set. Once created, elements cannot be added or removed. Because it is immutable, a frozenset is hashable, meaning it can be used as a dictionary key or as an element inside another set.
| Feature | set | frozenset |
|---|---|---|
| Mutable? | Yes | No |
| Hashable? | No | Yes |
| Use as dict key? | No | Yes |
| Put inside a set? | No | Yes |
| Methods | All set methods | Only non-modifying methods (union, etc.) |
๐ผ Why Data Analysts Care
โข Compound Keys: distances = {frozenset(['NY', 'LA']): 2444} โ order-independent keys
โข Caching: Use frozenset as arguments to cached/memoized functions
โข Immutable Constraints: Prevent accidental modification of global rule sets
โ ๏ธ Sets of Sets
You cannot create a set of sets like {{1, 2}, {3, 4}} because sets are unhashable. You MUST use frozensets: {frozenset([1, 2]), frozenset([3, 4])}.
๐งช Concept Checks: Frozen Sets
Q1. Create a frozenset from a list [1, 2, 3]. Try to use .add(4) and catch the AttributeError. Print the error message.
Q2. Demonstrate that standard sets cannot be dict keys (catch TypeError), but frozensets can.
Q3. Create a dictionary where keys are frozensets of two users, and values are their relationship status. Retrieve a value using the reverse order of users.
Q4. Show that frozenset([1, 2, 3]) | frozenset([3, 4, 5]) works and returns a frozenset. Print the type of the result.
Q5. Create a set containing three frozensets. Print the set.
6. Sets vs Lists : O(1) Lookups & Performance
Sets use hash tables under the hood. This means checking if an item is in a set takes O(1) time (constant time), regardless of how large the set is. Checking if an item is in a list takes O(n) time, which gets slower as the list grows.
| Operation | List Time Complexity | Set Time Complexity | Speedup |
|---|---|---|---|
x in collection | O(n) | O(1) | Massive for large data |
| Deduplicate | O(n^2) (naive) | O(n) | Huge |
| Iteration | O(n) | O(n) | Equal |
๐ผ Why Data Analysts Care
โข Filtering Pipelines: Convert blacklist of IDs to a set before filtering a large DataFrame
โข Cross-referencing: Checking which users in Table A exist in Table B
โข Performance Optimization: Replacing if item in list: with if item in set: is the easiest performance win in Python
๐ง Pro Tip
If you need to check membership (in) inside a loop over a large dataset, ALWAYS convert the lookup collection to a set first.
๐งช Concept Checks: Performance
Q1. Recreate the performance benchmark: time looking up -1 (not in collection) in a list of 1,000,000 items vs a set. Print the times.
Q2. Given a huge text, extracting unique words using list.append (if not in list) vs set.add. Explain why the set approach is O(n) and list approach is O(n^2).
Q3. Write a function find_common(l1, l2) that finds common elements between two large lists. Convert them to sets and use intersection. Time the function.
Q4. Write a loop that filters a list of 100,000 numbers, keeping only those present in a valid_ids list of 10,000 numbers. Show how converting valid_ids to a set speeds it up.
Q5. When would you NOT use a set for lookups? (Hint: Order preservation, duplicate counting). Explain briefly.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (Data Deduplicator): Write a function unique_sorted_chars(text) that takes a string, removes duplicates, removes spaces, and returns a sorted list of the unique characters. Test with 'hello world'.
Task 2 (Cohort Overlap): Given cohort_A = [1,2,3,4,5,6] and cohort_B = [4,5,6,7,8], use sets to find: IDs in both, IDs only in A, IDs only in B, and IDs in exactly one cohort. Print all 4 lists.
Task 3 (Vocabulary Analyzer): Write a function jaccard_similarity(doc1, doc2) that splits two strings into sets of words and calculates: len(intersection) / len(union). Test with two short sentences.
Task 4 (Access Controller): Given user_roles = {'read', 'write'} and admin_roles = {'read', 'write', 'delete'}, write code that uses subset/superset methods to check if the user has all admin roles, and if not, which ones are missing.
Task 5 (Fast Filter): Create a list of 100,000 random integers. Create a list of 5,000 'blacklist' integers. Filter the main list to remove blacklisted integers using a set for O(1) lookups. Print the final length.
๐ป Pure Coding Interview Questions
Q1.
Write a function has_duplicates(lst) that returns True if a list contains duplicates, using sets. Do it in one line.
Q2.
Write a function missing_numbers(lst, n) that finds all missing numbers from 1 to n in a list. Use sets.
Q3.
Write a function intersection_multiple(*sets) that returns the intersection of an arbitrary number of sets.
Q4.
Write a function symmetric_diff_multiple(*sets) that finds elements present in exactly one of the sets.
Q5.
Write a function is_pangram(text) that checks if a string contains every letter of the alphabet using sets.
Q6.
Given two lists of dictionaries, find the symmetric difference based on a specific key (e.g., 'id').
Q7.
Write a function remove_duplicates_preserve_order(lst) without using external libraries, but using a set for O(1) lookups.
Q8.
Implement the Jaccard similarity index for two lists of items. Handle empty sets gracefully.
Q9.
Write a function that takes a string and returns the length of the longest substring without repeating characters (sliding window + set).
Q10.
Write a function find_pairs_sum(lst, target) that finds all unique pairs that sum to target, using a set for O(n) time.
Q11.
Write a function find_common_chars(words) that returns a list of characters present in all words in a list.
Q12.
Write code to flatten a list of sets into a single set: [{1,2}, {2,3}, {3,4}] -> {1,2,3,4}.
Q13.
Explain the difference between s.remove(x) and s.discard(x). Write a wrapper function safe_remove(s, x) that mimics discard using remove.
Q14.
Write a function is_anagram_set_trap(s1, s2) that shows why set(s1) == set(s2) is NOT a valid way to check for anagrams.
Q15.
Create a dictionary acting as an undirected graph, where edges are represented by frozensets. frozenset([A, B]): weight.
Q16.
Given a list of lists, deduplicate the inner lists. [[1,2], [2,1], [3,4]] -> [[1,2], [3,4]]. Use frozensets.
Q17.
Write a function group_identical(lst) that groups identical dictionaries in a list using frozensets of their items.
Q18.
Write a function set_from_generator(n) that creates a set of the first n prime numbers using a generator expression inside set().
Q19.
Explain why {[]} raises a TypeError but {()} does not. Write code demonstrating this hashability concept.
Q20.
Write a function longest_consecutive_sequence(nums) that runs in O(n) time using a set.
Q21.
Write a function to find the elements that appear exactly once in an array where every other element appears twice. (Can use XOR or sets).
Q22.
Write a function word_break(s, word_dict) that checks if a string can be segmented into words from a dictionary. Convert dict to set.
Q23.
Write a function is_bipartite(graph) using two sets to color nodes and detect odd-length cycles.
Q24.
Implement a custom MySet class wrapping a dictionary to demonstrate how sets work under the hood.
Q25.
Write a function count_unique_vowels(word) using set intersection with {'a','e','i','o','u'}.
๐ Day 6 Executive Summary
| # | Topic | Key Takeaway | Professional Application | |
|---|---|---|---|---|
| 1 | Creation | set(), {1, 2}. Mutable, unordered. | Data deduplication | |
| 2 | Operations | Union ` | , Intersection &, Difference -` | Cohort and overlap analysis |
| 3 | Methods | add, discard (safe), remove (unsafe) | Incremental processing | |
| 4 | Relations | issubset <=, issuperset >= | Data validation, permissions | |
| 5 | Frozen Sets | Immutable, hashable sets | Sets of sets, Dict keys | |
| 6 | Performance | O(1) membership testing (in) | Massive optimization over lists |
โ Instructor's End-of-Day Checklist
โข [ ] I understand that sets are unordered and cannot be accessed by index.
โข [ ] I know how to deduplicate a list instantly using set().
โข [ ] I can perform union, intersection, and difference operations.
โข [ ] I understand the difference between .remove() and .discard().
โข [ ] I know when to use a frozenset (dict keys, sets of sets).
โข [ ] I understand why in set is exponentially faster than in list.
โข [ ] I have completed all 5 practice tasks.
โข [ ] I have reviewed all 25 interview questions.