Module 1: Introduction to Data Science
1. Explain Python’s role in modern data science workflows.
Python is a cornerstone of data science due to its simplicity, versatility, and rich ecosystem of libraries. It supports data ingestion (e.g., Pandas, SQLAlchemy), preprocessing (NumPy, Pandas), visualization (Matplotlib, Seaborn, Plotly), machine learning (Scikit-learn, TensorFlow, PyTorch), and deployment (Flask, FastAPI). Its community-driven development ensures compatibility with modern tools like Jupyter notebooks for interactive analysis and cloud platforms (AWS, GCP) for scalable pipelines. Python’s flexibility allows seamless integration with big data frameworks (PySpark, Dask) and LLMs (Hugging Face), making it ideal for end-to-end workflows.
2. Compare Python with R for statistical computing. What are Python’s advantages?
- Python: General-purpose, with a broader ecosystem (e.g., Pandas, NumPy, Scikit-learn). Excels in production-grade applications, web integration (Flask, Django), and scalability (PySpark, Dask). Strong for deep learning (TensorFlow, PyTorch) and cloud deployments.
- R: Specialized for statistical analysis and visualization (ggplot2, dplyr). Preferred in academia for advanced statistical modeling but less flexible for non-statistical tasks.
- Python’s Advantages: Wider applicability, better integration with production systems, stronger support for modern ML/AI frameworks, and a larger developer community. Python’s performance optimizations (e.g., NumPy’s C-based operations) and scalability make it more versatile for 2025’s data science demands.
3. Describe the lifecycle of a data science project.
- Problem Definition: Identify business objectives and translate them into data science goals (e.g., predict churn).
- Data Collection: Gather data from APIs, databases, or web scraping.
- Data Cleaning: Handle missing values, outliers, and inconsistencies (Pandas, NumPy).
- Exploratory Data Analysis (EDA): Visualize trends and correlations (Seaborn, Matplotlib).
- Feature Engineering: Create or transform features to improve model performance.
- Modeling: Train and evaluate models (Scikit-learn, XGBoost, TensorFlow).
- Validation: Use cross-validation or train-test splits to assess model robustness.
- Deployment: Integrate models into production (Flask, AWS Lambda, FastAPI).
- Monitoring: Track model performance and retrain as needed (MLOps tools like MLflow).
- Communication: Present insights to stakeholders via reports or dashboards (Plotly Dash).
Module 2: Python Basics
4. What is the difference between is and == in Python?
- ==: Checks for value equality. Compares the values of two objects.
- is: Checks for identity equality. Verifies if two objects point to the same memory location.
Example:
a = [1, 2, 3]
b = [1, 2, 3]
print(a == b) # True (same values)
print(a is b) # False (different objects)
c = a
print(a is c) # True (same memory location)
Note: Small integers (-5 to 256) and some strings may share memory due to Python’s interning, so is can be misleading for these.
5. Explain mutability with examples (e.g., lists vs. tuples).
- Mutable: Objects whose state can be changed after creation (e.g., lists, dictionaries).
- Immutable: Objects whose state cannot be changed (e.g., tuples, strings, integers).
Example:
# Mutable: List
my_list = [1, 2, 3]
my_list[0] = 10 # Modifies list
print(my_list) # [10, 2, 3]
# Immutable: Tuple
my_tuple = (1, 2, 3)
my_tuple[0] = 10 # Error: TypeError
Implication: Mutable objects are flexible but risk unintended changes; immutable objects are safer for thread-safe or hashable use (e.g., dictionary keys).
6. Write code to reverse a string using slicing.
text = “Hello, World!”
reversed_text = text[::-1]
print(reversed_text) # !dlroW ,olleH
Explanation: The slice [::-1] uses a step of -1 to traverse the string backward.
7. How does Python manage memory for immutable objects like integers?
Python uses object interning for small integers (-5 to 256) and some strings to save memory. Immutable objects cannot change, so Python reuses them when possible. When an immutable object is created, Python checks if it exists in memory; if so, it reuses the reference.
Example:
a = 42
b = 42
print(a is b) # True (interned integer)
a = 1000
b = 1000
print(a is b) # False (not interned, larger integers)
The garbage collector (using reference counting) deallocates objects when their reference count drops to zero.
8. When would you use a while loop instead of a for loop?
Use a while loop when:
- The number of iterations is unknown (e.g., based on a condition).
- You need to wait for an external event (e.g., user input, API response).
Example:
# While loop for unknown iterations
count = 0
while count < 5:
print(count)
count += 1
Use a for loop for known iterations (e.g., iterating over a list or range).
9. Write a lambda function to filter even numbers from a list.
numbers = [1, 2, 3, 4, 5, 6]
even_numbers = list(filter(lambda x: x % 2 == 0, numbers))
print(even_numbers) # [2, 4, 6]
Explanation: The lambda x: x % 2 == 0 function returns True for even numbers, and filter() retains those elements.
10. What are the differences between positional, keyword, and arbitrary arguments?
- Positional Arguments: Passed in order, matched by position.
- Keyword Arguments: Passed with a name, order doesn’t matter.
- Arbitrary Arguments: Allow variable numbers of arguments (*args for positional, **kwargs for keyword).
Example:
def func(a, b, *args, key=”default”, **kwargs):
print(a, b, args, key, kwargs)
func(1, 2, 3, 4, key=”value”, extra=”data”)
# Output: 1 2 (3, 4) value {‘extra’: ‘data’}
- a, b: Positional.
- *args: Captures extra positional args as a tuple.
- key: Keyword arg with default value.
- **kwargs: Captures extra keyword args as a dictionary.
Module 3: Data Types & Utilities
11. How does list comprehension improve performance compared to loops?
List comprehensions are faster because they’re implemented in C and avoid Python’s interpreter overhead for loop operations. They also create the list in one step, reducing memory allocation overhead.
Example:
# Loop
squares = []
for i in range(1000):
squares.append(i**2)
# List comprehension (faster)
squares = [i**2 for i in range(1000)]
Benchmark: Comprehensions can be 20-30% faster for large lists.
12. Explain backward indexing in Python with an example.
Backward indexing uses negative indices to access elements from the end of a sequence. -1 refers to the last element, -2 to the second-to-last, etc.
Example:
text = “Python”
print(text[-1]) # n
print(text[-3:]) # hon
13. Why are tuples faster than lists for certain operations?
Tuples are immutable, so Python allocates fixed memory, reducing overhead for resizing or modification. Tuples also have a simpler internal structure, making iteration and access slightly faster.
Example:
import timeit
print(timeit.timeit(“for x in [1, 2, 3]: pass”, number=1000000)) # ~0.07s
print(timeit.timeit(“for x in (1, 2, 3): pass”, number=1000000)) # ~0.05s
14. Write code to merge two dictionaries in Python (3.9+ syntax).
dict1 = {“a”: 1, “b”: 2}
dict2 = {“c”: 3, “d”: 4}
merged = dict1 | dict2 # Python 3.9+ merge operator
print(merged) # {‘a’: 1, ‘b’: 2, ‘c’: 3, ‘d’: 4}
Alternative (pre-3.9): merged = {**dict1, **dict2} or dict1.update(dict2).
15. How do sets ensure uniqueness of elements?
Sets use a hash table internally, where each element’s hash determines its position. Duplicate elements produce the same hash, so only one copy is stored. Elements must be immutable (e.g., numbers, strings, tuples) to be hashable.
Example:
my_set = {1, 2, 2, 3}
print(my_set) # {1, 2, 3}
16. When would you use a tuple instead of a list?
Use tuples when:
- Data should be immutable (e.g., dictionary keys, constants).
- Memory efficiency or speed is critical.
- You need hashable objects.
Example:
coords = (10, 20) # Immutable coordinates
my_dict = {coords: “point”} # Valid key
17. Explain shallow vs. deep copy in list operations.
- Shallow Copy: Copies the top-level structure but shares references to nested objects.
- Deep Copy: Recursively copies all objects, creating independent copies.
Example:
import copy
lst = [[1, 2], 3]
shallow = copy.copy(lst)
deep = copy.deepcopy(lst)
shallow[0][0] = 99
print(lst) # [[99, 2], 3] (nested list modified)
print(deep) # [[1, 2], 3] (independent)
Module 4: Strings & Regex
18. Convert “Hello World” to “hELLO wORLD” using string methods.
text = “Hello World”
result = text.swapcase()
print(result) # hELLO wORLD
Explanation: swapcase() converts uppercase to lowercase and vice versa.
19. Write a regex pattern to validate email addresses.
import re
pattern = r’^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$’
email = “user@example.com”
if re.match(pattern, email):
print(“Valid email”)
else:
print(“Invalid email”)
Explanation:
- [a-zA-Z0-9._%+-]+: Username allows letters, digits, and specific symbols.
- @: Literal @.
- [a-zA-Z0-9.-]+: Domain name.
- \.[a-zA-Z]{2,}: Top-level domain (e.g., .com).
20. Compare f-strings, str.format(), and % formatting in terms of readability and performance.
- f-strings (Python 3.6+): Most readable and fastest. Syntax: f”{var}”.
- str.format(): Flexible but less concise. Syntax: “{} {}”.format(a, b).
- % formatting: Legacy, error-prone. Syntax: “%s” % var.
Example:
name = “Alice”
age = 30
print(f”{name} is {age}”) # f-string
print(“{} is {}”.format(name, age)) # str.format
print(“%s is %d” % (name, age)) # % formatting
Performance: f-strings are ~10-20% faster due to compile-time evaluation. Readability: f-strings are clearest, especially for complex expressions.
Module 5: File Handling
21. How does Python handle large CSV files (10GB+) efficiently?
Use pandas.read_csv() with chunksize or dask.dataframe for out-of-memory processing. Alternatively, use csv.reader with generators for low-level control.
Example (Pandas chunks):
import pandas as pd
for chunk in pd.read_csv(“large_file.csv”, chunksize=100000):
process_chunk(chunk) # Process each chunk
Dask Example:
import dask.dataframe as dd
df = dd.read_csv(“large_file.csv”)
result = df.groupby(“column”).sum().compute()
Tips: Use compression (e.g., gzip), filter columns, and optimize data types.
22. Write code to read a JSON file and extract specific keys.
import json
with open(“data.json”, “r”) as file:
data = json.load(file)
keys = [“name”, “age”]
extracted = {k: data[k] for k in keys if k in data}
print(extracted)
Explanation: json.load() parses the JSON file into a dictionary, and a dictionary comprehension extracts the specified keys.
23. What is the difference between ‘r+’ and ‘w+’ file modes?
- r+: Opens file for reading and writing. File must exist, and writing starts at the current position (default: start).
- w+: Opens file for reading and writing. Creates a new file or truncates an existing one.
Example:
with open(“file.txt”, “w+”) as f:
f.write(“Hello”)
f.seek(0)
print(f.read()) # Hello (file created)
with open(“file.txt”, “r+”) as f:
f.write(“Hi”) # Overwrites start
f.seek(0)
print(f.read()) # Hi (file modified)
Module 6: Exception Handling
24. Write a custom exception class for validating user passwords.
class PasswordError(Exception):
pass
def validate_password(password):
if len(password) < 8:
raise PasswordError(“Password must be at least 8 characters”)
if not any(c.isdigit() for c in password):
raise PasswordError(“Password must contain a digit”)
try:
validate_password(“abc”)
except PasswordError as e:
print(e)
Explanation: The custom PasswordError class inherits from Exception and is raised with specific validation messages.
25. When should you use finally vs. else in a try block?
- else: Executes if no exception is raised in the try block. Useful for code that depends on successful execution.
- finally: Always executes, regardless of exceptions. Ideal for cleanup (e.g., closing files).
Example:
try:
x = 1 / 1
except ZeroDivisionError:
print(“Division by zero”)
else:
print(“No error”) # Runs if no exception
finally:
print(“Cleanup”) # Always runs
26. How would you debug a MemoryError in a data-heavy script?
- Profile Memory: Use tracemalloc or memory_profiler to identify high-memory objects.
- Optimize Data Structures: Use generators, chunked processing (e.g., Pandas chunksize), or compact types (e.g., int8 in NumPy).
- Scale Out: Move to distributed systems (Dask, PySpark) or cloud resources.
Example (tracemalloc):
import tracemalloc
tracemalloc.start()
# Your code
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics(“lineno”)
print(top_stats[:3]) # Top memory consumers
Module 7: OOP
27. Explain the __init__ method and its use cases.
The __init__ method is a constructor that initializes an instance’s attributes when an object is created. It’s called automatically during instantiation.
Example:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
p = Person(“Alice”, 30)
print(p.name, p.age) # Alice 30
Use Cases: Setting initial state, validating inputs, or establishing object dependencies.
28. Design a class hierarchy for a ride-sharing app (e.g., Driver, Rider).
class User:
def __init__(self, user_id, name):
self.user_id = user_id
self.name = name
class Driver(User):
def __init__(self, user_id, name, vehicle):
super().__init__(user_id, name)
self.vehicle = vehicle
self.is_available = True
def accept_ride(self):
self.is_available = False
class Rider(User):
def __init__(self, user_id, name, payment_method):
super().__init__(user_id, name)
self.payment_method = payment_method
def request_ride(self, destination):
print(f”{self.name} requested a ride to {destination}”)
driver = Driver(1, “Bob”, “Toyota”)
rider = Rider(2, “Alice”, “Visa”)
rider.request_ride(“Downtown”)
Explanation: User is the base class with shared attributes. Driver and Rider inherit and add specific functionality.
29. How does polymorphism work in Python with dynamic typing?
Polymorphism allows objects of different classes to be treated as instances of a common interface or superclass. Python’s dynamic typing enables this via duck typing—if an object supports the required methods, it can be used interchangeably.
Example:
class Car:
def move(self):
return “Driving”
class Bike:
def move(self):
return “Cycling”
def travel(vehicle):
print(vehicle.move())
travel(Car()) # Driving
travel(Bike()) # Cycling
Explanation: The travel function works with any object that has a move method, regardless of its class.
30. What is the purpose of @property decorators?
The @property decorator allows methods to be accessed like attributes, enabling getter/setter logic without changing the interface.
Example:
class Circle:
def __init__(self, radius):
self._radius = radius
@property
def radius(self):
return self._radius
@radius.setter
def radius(self, value):
if value < 0:
raise ValueError(“Radius cannot be negative”)
self._radius = value
c = Circle(5)
print(c.radius) # 5
c.radius = 10 # Calls setter
print(c.radius) # 10
Module 8: Modules & Libraries
31. How does if __name__ == “__main__” work?
It checks if the module is being run directly (not imported). __name__ is set to “__main__” when a script is executed directly, allowing conditional execution of code.
Example:
# my_script.py
def func():
print(“Function”)
if __name__ == “__main__”:
func() # Runs only if script is executed directly
Use Case: Prevents code from running when the module is imported.
32. Write code to generate a random password using the secrets module.
import secrets
import string
def generate_password(length=12):
characters = string.ascii_letters + string.digits + string.punctuation
return ”.join(secrets.choice(characters) for _ in range(length))
print(generate_password()) # e.g., “kX#9mP$2nL@q”
Explanation: secrets is cryptographically secure, unlike random, making it suitable for passwords.
33. How would you schedule a Python script to run daily using os or subprocess?
Use schedule library or a system scheduler (e.g., cron on Linux). For cron, use subprocess to set it up.
Example (cron):
import subprocess
script_path = “/path/to/script.py”
cron_job = f”0 0 * * * /usr/bin/python3 {script_path}\n”
with open(“cron_jobs”, “w”) as f:
f.write(cron_job)
subprocess.run([“crontab”, “cron_jobs”])
Explanation: Schedules the script to run at midnight daily. Alternatively, use schedule for in-Python scheduling.
Module 9: Data Science Libraries
34. Optimize a Pandas DataFrame to reduce memory usage.
- Downcast Numeric Types: Use int8, float32, etc.
- Use Categorical Types: For columns with few unique values.
- Drop Unnecessary Columns: Remove unused data.
Example:
import pandas as pd
df = pd.DataFrame({“A”: [1, 2, 3], “B”: [“x”, “y”, “x”]})
df[“A”] = df[“A”].astype(“int8”)
df[“B”] = df[“B”].astype(“category”)
print(df.memory_usage(deep=True))
35. How does NumPy’s broadcasting improve computation speed?
Broadcasting allows NumPy to perform operations on arrays of different shapes by automatically expanding smaller arrays to match the larger one’s shape, avoiding explicit loops.
Example:
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([10, 20])
result = a + b # b is broadcasted to [[10, 20], [10, 20]]
print(result) # [[11, 22], [13, 24]]
Speed: Broadcasting uses optimized C loops, reducing Python overhead.
36. Handle missing values in a DataFrame without dropping rows.
Use fillna() for imputation (mean, median, forward-fill) or interpolation.
Example:
python
Copy
import pandas as pd
df = pd.DataFrame({“A”: [1, None, 3], “B”: [4, 5, None]})
df[“A”] = df[“A”].fillna(df[“A”].mean())
df[“B”] = df[“B”].interpolate()
print(df)
37. Create a heatmap using Seaborn to visualize correlation matrices.
python
Copy
import pandas as pd
import seaborn as sns
import numpy as np
df = pd.DataFrame(np.random.rand(5, 5), columns=list(“ABCDE”))
corr = df.corr()
sns.heatmap(corr, annot=True, cmap=”coolwarm”)
Explanation: corr() computes the correlation matrix, and sns.heatmap() visualizes it with annotations.
38. Compare loc and iloc in Pandas.
- loc: Accesses data by label (row/column names).
- iloc: Accesses data by integer position.
Example:
python
Copy
import pandas as pd
df = pd.DataFrame({“A”: [1, 2, 3]}, index=[“x”, “y”, “z”])
print(df.loc[“x”, “A”]) # 1 (label-based)
print(df.iloc[0, 0]) # 1 (position-based)
Module 10: Advanced Python
39. Write a decorator to log function execution time.
python
Copy
import time
import functools
def timer(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
print(f”{func.__name__} took {time.time() – start:.2f} seconds”)
return result
return wrapper
@timer
def slow_func():
time.sleep(1)
slow_func() # slow_func took 1.00 seconds
40. How do generators save memory compared to lists?
Generators yield one item at a time, avoiding storing the entire sequence in memory. Lists store all elements upfront.
Example:
python
Copy
def gen_nums(n):
for i in range(n):
yield i
g = gen_nums(1000000) # Minimal memory
lst = list(range(1000000)) # High memory
41. Explain the GIL and its impact on multithreading.
The Global Interpreter Lock (GIL) is a mutex in CPython that prevents multiple native threads from executing Python bytecodes simultaneously. It simplifies memory management but limits true parallelism in CPU-bound tasks.
Impact: Multithreading is effective for I/O-bound tasks (e.g., network calls) but not CPU-bound tasks (e.g., computations). Use multiprocessing for CPU parallelism.
42. When would you use multiprocessing over multithreading?
Use multiprocessing for:
- CPU-bound tasks (e.g., data processing, ML training) to bypass the GIL.
- True parallelism across multiple CPU cores.
Use multithreading for I/O-bound tasks (e.g., file reading, API calls).
Example (multiprocessing):
python
Copy
from multiprocessing import Pool
def square(n):
return n * n
with Pool(4) as p:
results = p.map(square, [1, 2, 3, 4])
print(results) # [1, 4, 9, 16]
Practical Scenarios & Trends (2025)
43. Design a REST API in Python for a real-time recommendation system.
Use FastAPI for its async support and automatic OpenAPI documentation.
python
Copy
from fastapi import FastAPI
import numpy as np
app = FastAPI()
@app.get(“/recommend/{user_id}”)
async def recommend(user_id: int):
# Mock recommendation logic
items = np.random.choice([“item1”, “item2”, “item3”], size=3)
return {“user_id”: user_id, “recommendations”: list(items)}
Explanation: The API takes a user_id, generates mock recommendations, and returns JSON. Deploy with Uvicorn for production.
44. How would you deploy a machine learning model using Flask/FastAPI?
- Train model and save it (e.g., joblib).
- Create a REST API with Flask/FastAPI.
- Deploy on a cloud platform (e.g., AWS EC2, Heroku).
Example (FastAPI):
python
Copy
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load(“model.pkl”)
@app.post(“/predict”)
async def predict(data: dict):
features = data[“features”]
prediction = model.predict([features])[0]
return {“prediction”: prediction}
45. Optimize a Python script for parallel processing on AWS Lambda.
- Split tasks into independent chunks.
- Use AWS Lambda’s event-driven model with SQS or Step Functions.
- Use boto3 to trigger parallel Lambda invocations.
Example:
python
Copy
import boto3
lambda_client = boto3.client(“lambda”)
def lambda_handler(event, context):
tasks = event[“tasks”]
for task in tasks:
lambda_client.invoke(
FunctionName=”ParallelTask”,
InvocationType=”Event”,
Payload=json.dumps({“task”: task})
)
return {“status”: “Tasks dispatched”}
46. Implement a context manager for database connections.
python
Copy
from contextlib import contextmanager
import sqlite3
@contextmanager
def db_connection(db_name):
conn = sqlite3.connect(db_name)
try:
yield conn
finally:
conn.close()
with db_connection(“example.db”) as conn:
cursor = conn.cursor()
cursor.execute(“SELECT * FROM table”)
results = cursor.fetchall()
Explanation: The context manager ensures the connection is closed even if an error occurs.
47. How do you secure sensitive data in environment variables?
Use the python-dotenv library to load environment variables from a .env file, and never commit the .env file to version control.
Example:
python
Copy
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv(“API_KEY”)
print(api_key) # Securely accessed
Best Practices: Use secrets management services (e.g., AWS Secrets Manager) in production.
Problem-Solving & Algorithms
48. Reverse a linked list using Python.
python
Copy
class ListNode:
def __init__(self, val=0, next=None):
self.val = val
self.next = next
def reverse_list(head):
prev = None
current = head
while current:
next_node = current.next
current.next = prev
prev = current
current = next_node
return prev
49. Find the longest palindromic substring in a string.
python
Copy
def longest_palindrome(s):
n = len(s)
start = 0
max_len = 1
def expand_around_center(left, right):
while left >= 0 and right < n and s[left] == s[right]:
left -= 1
right += 1
return left + 1, right – left – 1
for i in range(n):
l1, len1 = expand_around_center(i, i) # Odd length
l2, len2 = expand_around_center(i, i + 1) # Even length
if len1 > max_len:
start, max_len = l1, len1
if len2 > max_len:
start, max_len = l2, len2
return s[start:start + max_len]
print(longest_palindrome(“babad”)) # “bab” or “aba”
50. Implement a binary search algorithm.
python
Copy
def binary_search(arr, target):
left, right = 0, len(arr) – 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid – 1
return -1
arr = [2, 3, 4, 10, 40]
print(binary_search(arr, 10)) # 3
Behavioral & Best Practices
51. How do you ensure code readability in a team environment?
- Follow PEP8 style guidelines (use flake8 or black).
- Write clear docstrings and comments.
- Use meaningful variable/function names.
- Refactor complex logic into smaller functions.
- Use type hints for clarity.
Example:
python
Copy
def calculate_total_price(items: list[float], tax_rate: float) -> float:
“””Calculate total price with tax.”””
subtotal = sum(items)
return subtotal * (1 + tax_rate)
52. Describe your approach to debugging a production outage.
- Reproduce: Identify the issue using logs (e.g., CloudWatch).
- Isolate: Check recent changes (code, config, data).
- Hypothesize: Use tools like pdb or logging to trace errors.
- Test Fixes: Apply in a staging environment.
- Monitor: Verify resolution and add alerts (e.g., Prometheus).
- Document: Record root cause and prevention steps.
53. What tools do you use for CI/CD in Python projects?
- CI: GitHub Actions, Jenkins, CircleCI for automated testing (pytest).
- CD: Deploy to AWS (CodePipeline), Heroku, or Docker containers.
- Linting/Type Checking: flake8, mypy.
- Example (GitHub Actions):
yaml
Copy
name: CI
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
– uses: actions/checkout@v3
– uses: actions/setup-python@v4
with:
python-version: ‘3.10’
– run: pip install pytest
– run: pytest
Advanced Data Science (2025 Trends)
54. How would you use PySpark for big data processing?
PySpark processes large datasets across distributed clusters using Spark’s RDDs or DataFrames.
Example:
python
Copy
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“BigData”).getOrCreate()
df = spark.read.csv(“large_data.csv”)
result = df.groupBy(“category”).agg({“value”: “sum”})
result.write.csv(“output”)
spark.stop()
Use Case: Aggregations, joins, or ML on terabyte-scale data.
55. Explain the role of Python in MLOps pipelines.
Python is central to MLOps for:
- Model Training: Scikit-learn, TensorFlow, PyTorch.
- Orchestration: Airflow for scheduling.
- Monitoring: MLflow for tracking experiments.
- Deployment: FastAPI, AWS SageMaker.
Example: Use MLflow to log model metrics.
python
Copy
import mlflow
with mlflow.start_run():
mlflow.log_param(“model”, “xgboost”)
mlflow.log_metric(“accuracy”, 0.95)
56. Integrate LLMs (e.g., GPT-4) into a Python workflow.
Use Hugging Face’s transformers or OpenAI’s API.
Example (Hugging Face):
python
Copy
from transformers import pipeline
generator = pipeline(“text-generation”, model=”gpt2″)
output = generator(“Hello, world!”, max_length=50)
print(output[0][“generated_text”])
Use Case: Text generation, summarization, or chatbots.
57. Handle real-time data streams with Python (e.g., Kafka, Apache Flink).
Use kafka-python for Kafka integration.
Example:
python
Copy
from kafka import KafkaConsumer
consumer = KafkaConsumer(“topic”, bootstrap_servers=”localhost:9092″)
for message in consumer:
print(message.value.decode(“utf-8”))
Flink: Use pyflink for stream processing on large-scale data.
System Design
58. Design a caching system for a high-traffic web app.
Use Redis for in-memory caching with TTL.
Components:
- Client: Sends requests to the app.
- App: Checks Redis cache; if miss, queries DB and caches result.
- Redis: Stores key-value pairs.
Example:
python
Copy
import redis
cache = redis.Redis(host=”localhost”, port=6379)
def get_data(key):
if cached := cache.get(key):
return cached.decode()
data = query_db(key) # Expensive DB call
cache.setex(key, 3600, data) # Cache for 1 hour
return data
59. How would you architect a scalable ETL pipeline using Airflow?
- Extract: Pull data from APIs/databases (Pandas, SQLAlchemy).
- Transform: Clean and aggregate (Pandas, PySpark).
- Load: Store in a data warehouse (Snowflake, BigQuery).
- Airflow: Schedule and monitor DAGs.
Example DAG:
python
Copy
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def extract(): pass
def transform(): pass
def load(): pass
with DAG(“etl_pipeline”, start_date=datetime(2025, 1, 1), schedule_interval=”@daily”) as dag:
t1 = PythonOperator(task_id=”extract”, python_callable=extract)
t2 = PythonOperator(task_id=”transform”, python_callable=transform)
t3 = PythonOperator(task_id=”load”, python_callable=load)
t1 >> t2 >> t3
Module 10: Advanced Python Concepts (Continued)
60. Write a thread-safe singleton class using __new__.
python
Copy
class Singleton:
_instance = None
_lock = threading.Lock()
def __new__(cls):
with cls._lock:
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
61. How does __slots__ optimize memory in Python classes?
__slots__ defines a fixed set of attributes, preventing the creation of a __dict__ for each instance, reducing memory overhead.
Example:
python
Copy
class Point:
__slots__ = (“x”, “y”)
def __init__(self, x, y):
self.x = x
self.y = y
62. Implement a generator to yield Fibonacci numbers indefinitely.
python
Copy
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
fib = fibonacci()
for _ in range(5):
print(next(fib)) # 0, 1, 1, 2, 3
63. What are context managers? Write one for handling database connections.
Context managers handle setup/teardown (e.g., opening/closing resources) using __enter__ and __exit__.
python
Copy
class DatabaseConnection:
def __init__(self, db_name):
self.db_name = db_name
def __enter__(self):
self.conn = sqlite3.connect(self.db_name)
return self.conn
def __exit__(self, exc_type, exc_val, exc_tb):
self.conn.close()
with DatabaseConnection(“example.db”) as conn:
cursor = conn.cursor()
cursor.execute(“SELECT * FROM table”)
64. Explain the use of asyncio for asynchronous programming.
asyncio enables non-blocking I/O operations using async/await. Ideal for I/O-bound tasks (e.g., API calls).
Example:
python
Copy
import asyncio
async def fetch_data():
await asyncio.sleep(1) # Simulate I/O
return “Data”
async def main():
result = await fetch_data()
print(result)
asyncio.run(main())
Module 11: Hands-on Projects
65. Design a CLI tool to scrape website data and save it to CSV.
python
Copy
import click
import requests
from bs4 import BeautifulSoup
import pandas as pd
@click.command()
@click.option(“–url”, default=”https://example.com”, help=”Website URL”)
def scrape(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, “html.parser”)
data = [{“text”: p.text} for p in soup.find_all(“p”)]
pd.DataFrame(data).to_csv(“output.csv”, index=False)
print(“Saved to output.csv”)
if __name__ == “__main__”:
scrape()
66. How would you build a sentiment analysis model using Python libraries?
Use transformers for a pre-trained model.
python
Copy
from transformers import pipeline
classifier = pipeline(“sentiment-analysis”)
text = “I love this product!”
result = classifier(text)
print(result) # [{‘label’: ‘POSITIVE’, ‘score’: 0.999}]
67. Create a Flask API to predict housing prices using a trained model.
python
Copy
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load(“housing_model.pkl”)
@app.route(“/predict”, methods=[“POST”])
def predict():
data = request.json[“features”]
prediction = model.predict([data])[0]
return jsonify({“price”: prediction})
if __name__ == “__main__”:
app.run()
Data Structures & Algorithms
68. Implement a stack using Python lists.
python
Copy
class Stack:
def __init__(self):
self.items = []
def push(self, item):
self.items.append(item)
def pop(self):
return self.items.pop()
def peek(self):
return self.items[-1]
def is_empty(self):
return len(self.items) == 0
69. Write code to detect cycles in a linked list.
python
Copy
class ListNode:
def __init__(self, val=0, next=None):
self.val = val
self.next = next
def has_cycle(head):
slow = fast = head
while fast and fast.next:
slow = slow.next
fast = fast.next.next
if slow == fast:
return True
return False
70. Solve the “Two Sum” problem with optimal time complexity.
python
Copy
def two_sum(nums, target):
seen = {}
for i, num in enumerate(nums):
complement = target – num
if complement in seen:
return [seen[complement], i]
seen[num] = i
return []
print(two_sum([2, 7, 11, 15], 9)) # [0, 1]
Complexity: O(n) using a hash map.
71. Traverse a binary tree in pre-order without recursion.
python
Copy
class TreeNode:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def preorder_traversal(root):
if not root:
return []
result = []
stack = [root]
while stack:
node = stack.pop()
result.append(node.val)
if node.right:
stack.append(node.right)
if node.left:
stack.append(node.left)
return result
72. Sort a list of dictionaries by a specific key.
python
Copy
data = [{“name”: “Bob”, “age”: 30}, {“name”: “Alice”, “age”: 25}]
sorted_data = sorted(data, key=lambda x: x[“age”])
print(sorted_data) # [{‘name’: ‘Alice’, ‘age’: 25}, {‘name’: ‘Bob’, ‘age’: 30}]
Advanced Data Science & Libraries
73. How does Pandas handle categorical data? When is it useful?
Pandas uses the category dtype to store data with few unique values, reducing memory and speeding up operations like grouping.
Example:
python
Copy
import pandas as pd
df = pd.DataFrame({“color”: [“red”, “blue”, “red”]})
df[“color”] = df[“color”].astype(“category”)
Use Case: Columns with repetitive values (e.g., gender, status).
74. Reshape a NumPy array from (3, 4) to (4, 3) without copying data.
python
Copy
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
reshaped = arr.T # Transpose, no copy
print(reshaped.shape) # (4, 3)
75. Handle imbalanced datasets in a classification problem.
- Resampling: Oversample minority (SMOTE) or undersample majority.
- Class Weights: Adjust weights in the model (e.g., Scikit-learn’s class_weight).
- Evaluation Metrics: Use precision, recall, or F1-score instead of accuracy.
Example (SMOTE):
python
Copy
from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X, y)
76. Use groupby in Pandas to calculate aggregate metrics.
python
Copy
import pandas as pd
df = pd.DataFrame({“category”: [“A”, “A”, “B”], “value”: [10, 20, 30]})
result = df.groupby(“category”)[“value”].mean()
print(result)
# category
# A 15.0
# B 30.0
77. Optimize a Pandas apply function with vectorization.
Replace apply with vectorized operations for speed.
Example:
python
Copy
import pandas as pd
df = pd.DataFrame({“A”: [1, 2, 3]})
# Slow
df[“B”] = df[“A”].apply(lambda x: x * 2)
# Fast
df[“B”] = df[“A”] * 2
OOP & Design Patterns
78. Implement the Factory pattern for a logging system.
python
Copy
class Logger:
def log(self, message):
pass
class ConsoleLogger(Logger):
def log(self, message):
print(f”Console: {message}”)
class FileLogger(Logger):
def log(self, message):
with open(“log.txt”, “a”) as f:
f.write(f”File: {message}\n”)
class LoggerFactory:
@staticmethod
def create_logger(logger_type):
if logger_type == “console”:
return ConsoleLogger()
elif logger_type == “file”:
return FileLogger()
raise ValueError(“Unknown logger”)
logger = LoggerFactory.create_logger(“console”)
logger.log(“Test”)
79. What is the Observer pattern? Write a Python example.
The Observer pattern allows objects (observers) to be notified of changes in a subject.
python
Copy
class Subject:
def __init__(self):
self._observers = []
def attach(self, observer):
self._observers.append(observer)
def notify(self, message):
for observer in self._observers:
observer.update(message)
class Observer:
def update(self, message):
print(f”Received: {message}”)
subject = Subject()
observer = Observer()
subject.attach(observer)
subject.notify(“Event occurred”)
80. Design a RESTful class hierarchy using inheritance.
python
Copy
class Resource:
def get(self, id):
return {“id”: id}
class UserResource(Resource):
def get(self, id):
return {“id”: id, “type”: “user”}
class ProductResource(Resource):
def get(self, id):
return {“id”: id, “type”: “product”}
Exception Handling & Debugging
81. Write a decorator to retry failed API calls 3 times.
python
Copy
import time
import functools
def retry(max_attempts=3, delay=1):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts – 1:
raise
time.sleep(delay)
return wrapper
return decorator
@retry()
def fetch_api():
raise ValueError(“API failed”)
82. How would you log exceptions to a file in production?
Use the logging module with a file handler.
python
Copy
import logging
logging.basicConfig(filename=”app.log”, level=logging.ERROR)
try:
1 / 0
except Exception as e:
logging.exception(“An error occurred”)
83. Debug a memory leak in a long-running Python process.
- Use tracemalloc to track allocations.
- Check for unclosed resources (files, connections).
- Monitor with psutil for process memory usage.
Example:
python
Copy
import tracemalloc
tracemalloc.start()
# Code
snapshot = tracemalloc.take_snapshot()
print(snapshot.statistics(“lineno”)[:3])
Cloud & DevOps Integration
84. Containerize a Python app using Docker.
Dockerfile:
dockerfile
Copy
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD [“python”, “app.py”]
Build/Run:
bash
Copy
docker build -t my-app .
docker run -p 5000:5000 my-app
85. Automate deployment to AWS EC2 using GitHub Actions.
Workflow:
yaml
Copy
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
– uses: actions/checkout@v3
– name: Deploy to EC2
env:
EC2_HOST: ${{ secrets.EC2_HOST }}
EC2_USER: ${{ secrets.EC2_USER }}
EC2_KEY: ${{ secrets.EC2_KEY }}
run: |
echo “$EC2_KEY” > key.pem
chmod 400 key.pem
ssh -i key.pem $EC2_USER@$EC2_HOST ‘bash -s’ < deploy.sh
86. Monitor a Python microservice’s performance with Prometheus.
Use prometheus_client to expose metrics.
python
Copy
from prometheus_client import Counter, start_http_server
requests_total = Counter(“requests_total”, “Total requests”)
start_http_server(8000)
@app.route(“/endpoint”)
def endpoint():
requests_total.inc()
return “OK”
2025 Trends & Tools
87. How would you use Python for quantum computing simulations?
Use Qiskit for quantum circuit simulations.
python
Copy
from qiskit import QuantumCircuit, Aer, execute
qc = QuantumCircuit(2, 2)
qc.h(0)
qc.cx(0, 1)
qc.measure([0, 1], [0, 1])
simulator = Aer.get_backend(“qasm_simulator”)
result = execute(qc, simulator, shots=1024).result()
print(result.get_counts())
88. Implement a real-time dashboard with Plotly Dash.
python
Copy
from dash import Dash, html, dcc
import plotly.express as px
app = Dash(__name__)
df = px.data.iris()
fig = px.scatter(df, x=”sepal_width”, y=”sepal_length”)
app.layout = html.Div([dcc.Graph(figure=fig)])
app.run_server(debug=True)
89. Fine-tune a Hugging Face transformer model for text generation.
python
Copy
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer
model = AutoModelForCausalLM.from_pretrained(“gpt2”)
tokenizer = AutoTokenizer.from_pretrained(“gpt2”)
# Fine-tune with dataset
trainer = Trainer(model=model, train_dataset=dataset)
trainer.train()
90. Use Python to automate infrastructure provisioning with Terraform.
Use python-terraform to interact with Terraform.
python
Copy
from python_terraform import Terraform
tf = Terraform(working_dir=”.”)
tf.init()
tf.apply(auto_approve=True)
System Design & Scalability
91. Design a distributed task queue using Celery.
Components:
- Celery Workers: Process tasks.
- Broker: RabbitMQ/Redis for task queuing.
- Backend: Store results.
Example:
python
Copy
from celery import Celery
app = Celery(“tasks”, broker=”redis://localhost:6379″, backend=”redis://localhost:6379″)
@app.task
def process_data(data):
return data * 2
92. Architect a serverless data pipeline with AWS Lambda.
- Trigger: S3 file upload.
- Lambda: Process data (e.g., Pandas).
- Output: Store in DynamoDB.
Example Lambda:
python
Copy
import json
import boto3
def lambda_handler(event, context):
s3 = boto3.client(“s3”)
bucket = event[“Records”][0][“s3”][“bucket”][“name”]
key = event[“Records”][0][“s3”][“object”][“key”]
# Process file
return {“status”: “Processed”}
93. How would you shard a MySQL database for a Python app?
- Strategy: Partition data by key (e.g., user_id).
- Implementation: Use SQLAlchemy with multiple database connections.
- Example: Route queries based on shard key.
python
Copy
from sqlalchemy import create_engine
shards = {0: create_engine(“mysql://user:pass@host1/db”), 1: create_engine(“mysql://user:pass@host2/db”)}
def get_shard(user_id):
return shards[user_id % 2]
Security & Ethics
94. Prevent SQL injection in raw SQL queries with Python.
Use parameterized queries with SQLAlchemy or mysql-connector.
python
Copy
import sqlite3
conn = sqlite3.connect(“example.db”)
cursor = conn.cursor()
user_input = “malicious; DROP TABLE users;”
cursor.execute(“SELECT * FROM users WHERE name = ?”, (user_input,))
95. Implement rate limiting in a Flask API.
Use flask-limiter.
python
Copy
from flask import Flask
from flask_limiter import Limiter
app = Flask(__name__)
limiter = Limiter(app, default_limits=[“100 per day”])
@app.route(“/endpoint”)
@limiter.limit(“5 per minute”)
def endpoint():
return “OK”
96. How do you ensure GDPR compliance in data processing scripts?
- Consent: Store user consent for data processing.
- Anonymization: Remove PII (e.g., names, emails).
- Data Deletion: Implement mechanisms to delete user data on request.
- Audit Logs: Track data access.
Example: Anonymize data with hashlib.
python
Copy
import hashlib
email = “user@example.com”
anonymized = hashlib.sha256(email.encode()).hexdigest()
Testing & Best Practices
97. Write unit tests for a function using pytest mocks.
python
Copy
from unittest.mock import patch
import pytest
def fetch_data():
return “Real data”
@patch(“module.fetch_data”)
def test_fetch_data(mock_fetch):
mock_fetch.return_value = “Mocked data”
assert fetch_data() == “Mocked data”
98. Use mypy for static type checking in a legacy codebase.
Add type hints and run mypy.
python
Copy
def add(a: int, b: int) -> int:
return a + b
Command: mypy script.py
99. How do you enforce PEP8 compliance in a CI pipeline?
Use flake8 in GitHub Actions.
yaml
Copy
name: Lint
on: [push]
jobs:
lint:
runs-on: ubuntu-latest
steps:
– uses: actions/checkout@v3
– uses: actions/setup-python@v4
– run: pip install flake8
– run: flake8 .
Final Question
100. Build a Python pipeline to process 1TB of data on a budget. What tools and strategies would you use?
Strategy:
- Tools:
- Dask/PySpark: For distributed processing.
- AWS S3: Store raw data (cost: ~$0.023/GB/month).
- AWS EC2 Spot Instances: Cheap compute (~70% savings).
- Pandas: For small-scale prototyping.
- Pipeline:
- Extract: Read data from S3 in chunks (Dask).
- Transform: Filter, aggregate, or clean (e.g., remove nulls).
- Load: Save results back to S3 or a database.
- Optimization:
- Use compressed formats (Parquet).
- Partition data by key (e.g., date).
- Monitor costs with AWS Budgets.
Example (Dask):
python
Copy
import dask.dataframe as dd
df = dd.read_parquet(“s3://bucket/data/*.parquet”)
result = df.groupby(“category”).sum().compute()
result.to_parquet(“s3://bucket/output/”)
Cost Estimate: ~$50-100 for 1TB processing using spot instances and S3.