Python interview Question

Module 1: Introduction to Data Science

1. Explain Python’s role in modern data science workflows.
Python is a cornerstone of data science due to its simplicity, versatility, and rich ecosystem of libraries. It supports data ingestion (e.g., Pandas, SQLAlchemy), preprocessing (NumPy, Pandas), visualization (Matplotlib, Seaborn, Plotly), machine learning (Scikit-learn, TensorFlow, PyTorch), and deployment (Flask, FastAPI). Its community-driven development ensures compatibility with modern tools like Jupyter notebooks for interactive analysis and cloud platforms (AWS, GCP) for scalable pipelines. Python’s flexibility allows seamless integration with big data frameworks (PySpark, Dask) and LLMs (Hugging Face), making it ideal for end-to-end workflows.

2. Compare Python with R for statistical computing. What are Python’s advantages?

Python: General-purpose, with a broader ecosystem (e.g., Pandas, NumPy, Scikit-learn). Excels in production-grade applications, web integration (Flask, Django), and scalability (PySpark, Dask). Strong for deep learning (TensorFlow, PyTorch) and cloud deployments.
R: Specialized for statistical analysis and visualization (ggplot2, dplyr). Preferred in academia for advanced statistical modeling but less flexible for non-statistical tasks.
Python’s Advantages: Wider applicability, better integration with production systems, stronger support for modern ML/AI frameworks, and a larger developer community. Python’s performance optimizations (e.g., NumPy’s C-based operations) and scalability make it more versatile for 2025’s data science demands.

3. Describe the lifecycle of a data science project.

Problem Definition: Identify business objectives and translate them into data science goals (e.g., predict churn).
Data Collection: Gather data from APIs, databases, or web scraping.
Data Cleaning: Handle missing values, outliers, and inconsistencies (Pandas, NumPy).
Exploratory Data Analysis (EDA): Visualize trends and correlations (Seaborn, Matplotlib).
Feature Engineering: Create or transform features to improve model performance.
Modeling: Train and evaluate models (Scikit-learn, XGBoost, TensorFlow).
Validation: Use cross-validation or train-test splits to assess model robustness.
Deployment: Integrate models into production (Flask, AWS Lambda, FastAPI).
Monitoring: Track model performance and retrain as needed (MLOps tools like MLflow).
Communication: Present insights to stakeholders via reports or dashboards (Plotly Dash).

Module 2: Python Basics

4. What is the difference between is and == in Python?

==: Checks for value equality. Compares the values of two objects.
is: Checks for identity equality. Verifies if two objects point to the same memory location.
Example:

a = [1, 2, 3]

b = [1, 2, 3]

print(a == b) # True (same values)

print(a is b) # False (different objects)

c = a

print(a is c) # True (same memory location)

Note: Small integers (-5 to 256) and some strings may share memory due to Python’s interning, so is can be misleading for these.

5. Explain mutability with examples (e.g., lists vs. tuples).

Mutable: Objects whose state can be changed after creation (e.g., lists, dictionaries).
Immutable: Objects whose state cannot be changed (e.g., tuples, strings, integers).
Example:

# Mutable: List

my_list = [1, 2, 3]

my_list[0] = 10 # Modifies list

print(my_list) # [10, 2, 3]

# Immutable: Tuple

my_tuple = (1, 2, 3)

my_tuple[0] = 10 # Error: TypeError

Implication: Mutable objects are flexible but risk unintended changes; immutable objects are safer for thread-safe or hashable use (e.g., dictionary keys).

6. Write code to reverse a string using slicing.

text = “Hello, World!”

reversed_text = text[::-1]

print(reversed_text) # !dlroW ,olleH

Explanation: The slice [::-1] uses a step of -1 to traverse the string backward.

7. How does Python manage memory for immutable objects like integers?
Python uses object interning for small integers (-5 to 256) and some strings to save memory. Immutable objects cannot change, so Python reuses them when possible. When an immutable object is created, Python checks if it exists in memory; if so, it reuses the reference.
Example:

a = 42

b = 42

print(a is b) # True (interned integer)

a = 1000

b = 1000

print(a is b) # False (not interned, larger integers)

The garbage collector (using reference counting) deallocates objects when their reference count drops to zero.

8. When would you use a while loop instead of a for loop?
Use a while loop when:

The number of iterations is unknown (e.g., based on a condition).
You need to wait for an external event (e.g., user input, API response).
Example:

# While loop for unknown iterations

count = 0

while count < 5:

print(count)

count += 1

Use a for loop for known iterations (e.g., iterating over a list or range).

9. Write a lambda function to filter even numbers from a list.

numbers = [1, 2, 3, 4, 5, 6]

even_numbers = list(filter(lambda x: x % 2 == 0, numbers))

print(even_numbers) # [2, 4, 6]

Explanation: The lambda x: x % 2 == 0 function returns True for even numbers, and filter() retains those elements.

10. What are the differences between positional, keyword, and arbitrary arguments?

Positional Arguments: Passed in order, matched by position.
Keyword Arguments: Passed with a name, order doesn’t matter.
Arbitrary Arguments: Allow variable numbers of arguments (*args for positional, **kwargs for keyword).
Example:

def func(a, b, *args, key=”default”, **kwargs):

print(a, b, args, key, kwargs)

func(1, 2, 3, 4, key=”value”, extra=”data”)

# Output: 1 2 (3, 4) value {‘extra’: ‘data’}

a, b: Positional.
*args: Captures extra positional args as a tuple.
key: Keyword arg with default value.
**kwargs: Captures extra keyword args as a dictionary.

Module 3: Data Types & Utilities

11. How does list comprehension improve performance compared to loops?
List comprehensions are faster because they’re implemented in C and avoid Python’s interpreter overhead for loop operations. They also create the list in one step, reducing memory allocation overhead.
Example:

# Loop

squares = []

for i in range(1000):

squares.append(i**2)

# List comprehension (faster)

squares = [i**2 for i in range(1000)]

Benchmark: Comprehensions can be 20-30% faster for large lists.

12. Explain backward indexing in Python with an example.
Backward indexing uses negative indices to access elements from the end of a sequence. -1 refers to the last element, -2 to the second-to-last, etc.
Example:

text = “Python”

print(text[-1]) # n

print(text[-3:]) # hon

13. Why are tuples faster than lists for certain operations?
Tuples are immutable, so Python allocates fixed memory, reducing overhead for resizing or modification. Tuples also have a simpler internal structure, making iteration and access slightly faster.
Example:

import timeit

print(timeit.timeit(“for x in [1, 2, 3]: pass”, number=1000000)) # ~0.07s

print(timeit.timeit(“for x in (1, 2, 3): pass”, number=1000000)) # ~0.05s

14. Write code to merge two dictionaries in Python (3.9+ syntax).

dict1 = {“a”: 1, “b”: 2}

dict2 = {“c”: 3, “d”: 4}

merged = dict1 | dict2 # Python 3.9+ merge operator

print(merged) # {‘a’: 1, ‘b’: 2, ‘c’: 3, ‘d’: 4}

Alternative (pre-3.9): merged = {**dict1, **dict2} or dict1.update(dict2).

15. How do sets ensure uniqueness of elements?
Sets use a hash table internally, where each element’s hash determines its position. Duplicate elements produce the same hash, so only one copy is stored. Elements must be immutable (e.g., numbers, strings, tuples) to be hashable.
Example:

my_set = {1, 2, 2, 3}

print(my_set) # {1, 2, 3}

16. When would you use a tuple instead of a list?
Use tuples when:

Data should be immutable (e.g., dictionary keys, constants).
Memory efficiency or speed is critical.
You need hashable objects.
Example:

coords = (10, 20) # Immutable coordinates

my_dict = {coords: “point”} # Valid key

17. Explain shallow vs. deep copy in list operations.

Shallow Copy: Copies the top-level structure but shares references to nested objects.
Deep Copy: Recursively copies all objects, creating independent copies.
Example:

import copy

lst = [[1, 2], 3]

shallow = copy.copy(lst)

deep = copy.deepcopy(lst)

shallow[0][0] = 99

print(lst) # [[99, 2], 3] (nested list modified)

print(deep) # [[1, 2], 3] (independent)

Module 4: Strings & Regex

18. Convert “Hello World” to “hELLO wORLD” using string methods.

text = “Hello World”

result = text.swapcase()

print(result) # hELLO wORLD

Explanation: swapcase() converts uppercase to lowercase and vice versa.

19. Write a regex pattern to validate email addresses.

import re

pattern = r’^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$’

email = “user@example.com”

if re.match(pattern, email):

print(“Valid email”)

else:

print(“Invalid email”)

Explanation:

[a-zA-Z0-9._%+-]+: Username allows letters, digits, and specific symbols.
@: Literal @.
[a-zA-Z0-9.-]+: Domain name.
\.[a-zA-Z]{2,}: Top-level domain (e.g., .com).

20. Compare f-strings, str.format(), and % formatting in terms of readability and performance.

f-strings (Python 3.6+): Most readable and fastest. Syntax: f”{var}”.
str.format(): Flexible but less concise. Syntax: “{} {}”.format(a, b).
% formatting: Legacy, error-prone. Syntax: “%s” % var.
Example:

name = “Alice”

age = 30

print(f”{name} is {age}”) # f-string

print(“{} is {}”.format(name, age)) # str.format

print(“%s is %d” % (name, age)) # % formatting

Performance: f-strings are ~10-20% faster due to compile-time evaluation. Readability: f-strings are clearest, especially for complex expressions.

Module 5: File Handling

21. How does Python handle large CSV files (10GB+) efficiently?
Use pandas.read_csv() with chunksize or dask.dataframe for out-of-memory processing. Alternatively, use csv.reader with generators for low-level control.
Example (Pandas chunks):

import pandas as pd

for chunk in pd.read_csv(“large_file.csv”, chunksize=100000):

process_chunk(chunk) # Process each chunk

Dask Example:

import dask.dataframe as dd

df = dd.read_csv(“large_file.csv”)

result = df.groupby(“column”).sum().compute()

Tips: Use compression (e.g., gzip), filter columns, and optimize data types.

22. Write code to read a JSON file and extract specific keys.

import json

with open(“data.json”, “r”) as file:

data = json.load(file)

keys = [“name”, “age”]

extracted = {k: data[k] for k in keys if k in data}

print(extracted)

Explanation: json.load() parses the JSON file into a dictionary, and a dictionary comprehension extracts the specified keys.

23. What is the difference between ‘r+’ and ‘w+’ file modes?

r+: Opens file for reading and writing. File must exist, and writing starts at the current position (default: start).
w+: Opens file for reading and writing. Creates a new file or truncates an existing one.
Example:

with open(“file.txt”, “w+”) as f:

f.write(“Hello”)

f.seek(0)

print(f.read()) # Hello (file created)

with open(“file.txt”, “r+”) as f:

f.write(“Hi”) # Overwrites start

f.seek(0)

print(f.read()) # Hi (file modified)

Module 6: Exception Handling

24. Write a custom exception class for validating user passwords.

class PasswordError(Exception):

pass

def validate_password(password):

if len(password) < 8:

raise PasswordError(“Password must be at least 8 characters”)

if not any(c.isdigit() for c in password):

raise PasswordError(“Password must contain a digit”)

try:

validate_password(“abc”)

except PasswordError as e:

print(e)

Explanation: The custom PasswordError class inherits from Exception and is raised with specific validation messages.

25. When should you use finally vs. else in a try block?

else: Executes if no exception is raised in the try block. Useful for code that depends on successful execution.
finally: Always executes, regardless of exceptions. Ideal for cleanup (e.g., closing files).
Example:

try:

x = 1 / 1

except ZeroDivisionError:

print(“Division by zero”)

else:

print(“No error”) # Runs if no exception

finally:

print(“Cleanup”) # Always runs

26. How would you debug a MemoryError in a data-heavy script?

Profile Memory: Use tracemalloc or memory_profiler to identify high-memory objects.
Optimize Data Structures: Use generators, chunked processing (e.g., Pandas chunksize), or compact types (e.g., int8 in NumPy).
Scale Out: Move to distributed systems (Dask, PySpark) or cloud resources.
Example (tracemalloc):

import tracemalloc

tracemalloc.start()

# Your code

snapshot = tracemalloc.take_snapshot()

top_stats = snapshot.statistics(“lineno”)

print(top_stats[:3]) # Top memory consumers

Module 7: OOP

27. Explain the __init__ method and its use cases.
The __init__ method is a constructor that initializes an instance’s attributes when an object is created. It’s called automatically during instantiation.
Example:

class Person:

def __init__(self, name, age):

self.name = name

self.age = age

p = Person(“Alice”, 30)

print(p.name, p.age) # Alice 30

Use Cases: Setting initial state, validating inputs, or establishing object dependencies.

28. Design a class hierarchy for a ride-sharing app (e.g., Driver, Rider).

class User:

def __init__(self, user_id, name):

self.user_id = user_id

self.name = name

class Driver(User):

def __init__(self, user_id, name, vehicle):

super().__init__(user_id, name)

self.vehicle = vehicle

self.is_available = True

def accept_ride(self):

self.is_available = False

class Rider(User):

def __init__(self, user_id, name, payment_method):

super().__init__(user_id, name)

self.payment_method = payment_method

def request_ride(self, destination):

print(f”{self.name} requested a ride to {destination}”)

driver = Driver(1, “Bob”, “Toyota”)

rider = Rider(2, “Alice”, “Visa”)

rider.request_ride(“Downtown”)

Explanation: User is the base class with shared attributes. Driver and Rider inherit and add specific functionality.

29. How does polymorphism work in Python with dynamic typing?
Polymorphism allows objects of different classes to be treated as instances of a common interface or superclass. Python’s dynamic typing enables this via duck typing—if an object supports the required methods, it can be used interchangeably.
Example:

class Car:

def move(self):

return “Driving”

class Bike:

def move(self):

return “Cycling”

def travel(vehicle):

print(vehicle.move())

travel(Car()) # Driving

travel(Bike()) # Cycling

Explanation: The travel function works with any object that has a move method, regardless of its class.

30. What is the purpose of @property decorators?
The @property decorator allows methods to be accessed like attributes, enabling getter/setter logic without changing the interface.
Example:

class Circle:

def __init__(self, radius):

self._radius = radius

@property

def radius(self):

return self._radius

@radius.setter

def radius(self, value):

if value < 0:

raise ValueError(“Radius cannot be negative”)

self._radius = value

c = Circle(5)

print(c.radius) # 5

c.radius = 10 # Calls setter

print(c.radius) # 10

Module 8: Modules & Libraries

31. How does if __name__ == “__main__” work?
It checks if the module is being run directly (not imported). __name__ is set to “__main__” when a script is executed directly, allowing conditional execution of code.
Example:

# my_script.py

def func():

print(“Function”)

if __name__ == “__main__”:

func() # Runs only if script is executed directly

Use Case: Prevents code from running when the module is imported.

32. Write code to generate a random password using the secrets module.

import secrets

import string

def generate_password(length=12):

characters = string.ascii_letters + string.digits + string.punctuation

return ”.join(secrets.choice(characters) for _ in range(length))

print(generate_password()) # e.g., “kX#9mP$2nL@q”

Explanation: secrets is cryptographically secure, unlike random, making it suitable for passwords.

33. How would you schedule a Python script to run daily using os or subprocess?
Use schedule library or a system scheduler (e.g., cron on Linux). For cron, use subprocess to set it up.
Example (cron):

import subprocess

script_path = “/path/to/script.py”

cron_job = f”0 0 * * * /usr/bin/python3 {script_path}\n”

with open(“cron_jobs”, “w”) as f:

f.write(cron_job)

subprocess.run([“crontab”, “cron_jobs”])

Explanation: Schedules the script to run at midnight daily. Alternatively, use schedule for in-Python scheduling.

Module 9: Data Science Libraries

34. Optimize a Pandas DataFrame to reduce memory usage.

Downcast Numeric Types: Use int8, float32, etc.
Use Categorical Types: For columns with few unique values.
Drop Unnecessary Columns: Remove unused data.
Example:

import pandas as pd

df = pd.DataFrame({“A”: [1, 2, 3], “B”: [“x”, “y”, “x”]})

df[“A”] = df[“A”].astype(“int8”)

df[“B”] = df[“B”].astype(“category”)

print(df.memory_usage(deep=True))

35. How does NumPy’s broadcasting improve computation speed?
Broadcasting allows NumPy to perform operations on arrays of different shapes by automatically expanding smaller arrays to match the larger one’s shape, avoiding explicit loops.
Example:

import numpy as np

a = np.array([[1, 2], [3, 4]])

b = np.array([10, 20])

result = a + b # b is broadcasted to [[10, 20], [10, 20]]

print(result) # [[11, 22], [13, 24]]

Speed: Broadcasting uses optimized C loops, reducing Python overhead.

36. Handle missing values in a DataFrame without dropping rows.
Use fillna() for imputation (mean, median, forward-fill) or interpolation.
Example:

python

Copy

import pandas as pd

df = pd.DataFrame({“A”: [1, None, 3], “B”: [4, 5, None]})

df[“A”] = df[“A”].fillna(df[“A”].mean())

df[“B”] = df[“B”].interpolate()

print(df)

37. Create a heatmap using Seaborn to visualize correlation matrices.

python

Copy

import pandas as pd

import seaborn as sns

import numpy as np

df = pd.DataFrame(np.random.rand(5, 5), columns=list(“ABCDE”))

corr = df.corr()

sns.heatmap(corr, annot=True, cmap=”coolwarm”)

Explanation: corr() computes the correlation matrix, and sns.heatmap() visualizes it with annotations.

38. Compare loc and iloc in Pandas.

loc: Accesses data by label (row/column names).
iloc: Accesses data by integer position.
Example:

python

Copy

import pandas as pd

df = pd.DataFrame({“A”: [1, 2, 3]}, index=[“x”, “y”, “z”])

print(df.loc[“x”, “A”]) # 1 (label-based)

print(df.iloc[0, 0]) # 1 (position-based)

Module 10: Advanced Python

39. Write a decorator to log function execution time.

python

Copy

import time

import functools

def timer(func):

@functools.wraps(func)

def wrapper(*args, **kwargs):

start = time.time()

result = func(*args, **kwargs)

print(f”{func.__name__} took {time.time() – start:.2f} seconds”)

return result

return wrapper

@timer

def slow_func():

time.sleep(1)

slow_func() # slow_func took 1.00 seconds

40. How do generators save memory compared to lists?
Generators yield one item at a time, avoiding storing the entire sequence in memory. Lists store all elements upfront.
Example:

python

Copy

def gen_nums(n):

for i in range(n):

yield i

g = gen_nums(1000000) # Minimal memory

lst = list(range(1000000)) # High memory

41. Explain the GIL and its impact on multithreading.
The Global Interpreter Lock (GIL) is a mutex in CPython that prevents multiple native threads from executing Python bytecodes simultaneously. It simplifies memory management but limits true parallelism in CPU-bound tasks.
Impact: Multithreading is effective for I/O-bound tasks (e.g., network calls) but not CPU-bound tasks (e.g., computations). Use multiprocessing for CPU parallelism.

42. When would you use multiprocessing over multithreading?
Use multiprocessing for:

CPU-bound tasks (e.g., data processing, ML training) to bypass the GIL.
True parallelism across multiple CPU cores.
Use multithreading for I/O-bound tasks (e.g., file reading, API calls).
Example (multiprocessing):

python

Copy

from multiprocessing import Pool

def square(n):

return n * n

with Pool(4) as p:

results = p.map(square, [1, 2, 3, 4])

print(results) # [1, 4, 9, 16]

Practical Scenarios & Trends (2025)

43. Design a REST API in Python for a real-time recommendation system.
Use FastAPI for its async support and automatic OpenAPI documentation.

python

Copy

from fastapi import FastAPI

import numpy as np

app = FastAPI()

@app.get(“/recommend/{user_id}”)

async def recommend(user_id: int):

# Mock recommendation logic

items = np.random.choice([“item1”, “item2”, “item3”], size=3)

return {“user_id”: user_id, “recommendations”: list(items)}

Explanation: The API takes a user_id, generates mock recommendations, and returns JSON. Deploy with Uvicorn for production.

44. How would you deploy a machine learning model using Flask/FastAPI?

Train model and save it (e.g., joblib).
Create a REST API with Flask/FastAPI.
Deploy on a cloud platform (e.g., AWS EC2, Heroku).
Example (FastAPI):

python

Copy

from fastapi import FastAPI

import joblib

app = FastAPI()

model = joblib.load(“model.pkl”)

@app.post(“/predict”)

async def predict(data: dict):

features = data[“features”]

prediction = model.predict([features])[0]

return {“prediction”: prediction}

45. Optimize a Python script for parallel processing on AWS Lambda.

Split tasks into independent chunks.
Use AWS Lambda’s event-driven model with SQS or Step Functions.
Use boto3 to trigger parallel Lambda invocations.
Example:

python

Copy

import boto3

lambda_client = boto3.client(“lambda”)

def lambda_handler(event, context):

tasks = event[“tasks”]

for task in tasks:

lambda_client.invoke(

FunctionName=”ParallelTask”,

InvocationType=”Event”,

Payload=json.dumps({“task”: task})

)

return {“status”: “Tasks dispatched”}

46. Implement a context manager for database connections.

python

Copy

from contextlib import contextmanager

import sqlite3

@contextmanager

def db_connection(db_name):

conn = sqlite3.connect(db_name)

try:

yield conn

finally:

conn.close()

with db_connection(“example.db”) as conn:

cursor = conn.cursor()

cursor.execute(“SELECT * FROM table”)

results = cursor.fetchall()

Explanation: The context manager ensures the connection is closed even if an error occurs.

47. How do you secure sensitive data in environment variables?
Use the python-dotenv library to load environment variables from a .env file, and never commit the .env file to version control.
Example:

python

Copy

from dotenv import load_dotenv

import os

load_dotenv()

api_key = os.getenv(“API_KEY”)

print(api_key) # Securely accessed

Best Practices: Use secrets management services (e.g., AWS Secrets Manager) in production.

Problem-Solving & Algorithms

48. Reverse a linked list using Python.

python

Copy

class ListNode:

def __init__(self, val=0, next=None):

self.val = val

self.next = next

def reverse_list(head):

prev = None

current = head

while current:

next_node = current.next

current.next = prev

prev = current

current = next_node

return prev

49. Find the longest palindromic substring in a string.

python

Copy

def longest_palindrome(s):

n = len(s)

start = 0

max_len = 1

def expand_around_center(left, right):

while left >= 0 and right < n and s[left] == s[right]:

left -= 1

right += 1

return left + 1, right – left – 1

for i in range(n):

l1, len1 = expand_around_center(i, i) # Odd length

l2, len2 = expand_around_center(i, i + 1) # Even length

if len1 > max_len:

start, max_len = l1, len1

if len2 > max_len:

start, max_len = l2, len2

return s[start:start + max_len]

print(longest_palindrome(“babad”)) # “bab” or “aba”

50. Implement a binary search algorithm.

python

Copy

def binary_search(arr, target):

left, right = 0, len(arr) – 1

while left <= right:

mid = (left + right) // 2

if arr[mid] == target:

return mid

elif arr[mid] < target:

left = mid + 1

else:

right = mid – 1

return -1

arr = [2, 3, 4, 10, 40]

print(binary_search(arr, 10)) # 3

Behavioral & Best Practices

51. How do you ensure code readability in a team environment?

Follow PEP8 style guidelines (use flake8 or black).
Write clear docstrings and comments.
Use meaningful variable/function names.
Refactor complex logic into smaller functions.
Use type hints for clarity.
Example:

python

Copy

def calculate_total_price(items: list[float], tax_rate: float) -> float:

“””Calculate total price with tax.”””

subtotal = sum(items)

return subtotal * (1 + tax_rate)

52. Describe your approach to debugging a production outage.

Reproduce: Identify the issue using logs (e.g., CloudWatch).
Isolate: Check recent changes (code, config, data).
Hypothesize: Use tools like pdb or logging to trace errors.
Test Fixes: Apply in a staging environment.
Monitor: Verify resolution and add alerts (e.g., Prometheus).
Document: Record root cause and prevention steps.

53. What tools do you use for CI/CD in Python projects?

CI: GitHub Actions, Jenkins, CircleCI for automated testing (pytest).
CD: Deploy to AWS (CodePipeline), Heroku, or Docker containers.
Linting/Type Checking: flake8, mypy.
Example (GitHub Actions):

yaml

Copy

on: [push]

jobs:

test:

runs-on: ubuntu-latest

steps:

– uses: actions/checkout@v3

– uses: actions/setup-python@v4

with:

python-version: ‘3.10’

– run: pip install pytest

– run: pytest

Advanced Data Science (2025 Trends)

54. How would you use PySpark for big data processing?
PySpark processes large datasets across distributed clusters using Spark’s RDDs or DataFrames.
Example:

python

Copy

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“BigData”).getOrCreate()

df = spark.read.csv(“large_data.csv”)

result = df.groupBy(“category”).agg({“value”: “sum”})

result.write.csv(“output”)

spark.stop()

Use Case: Aggregations, joins, or ML on terabyte-scale data.

55. Explain the role of Python in MLOps pipelines.
Python is central to MLOps for:

Model Training: Scikit-learn, TensorFlow, PyTorch.
Orchestration: Airflow for scheduling.
Monitoring: MLflow for tracking experiments.
Deployment: FastAPI, AWS SageMaker.
Example: Use MLflow to log model metrics.

python

Copy

import mlflow

with mlflow.start_run():

mlflow.log_param(“model”, “xgboost”)

mlflow.log_metric(“accuracy”, 0.95)

56. Integrate LLMs (e.g., GPT-4) into a Python workflow.
Use Hugging Face’s transformers or OpenAI’s API.
Example (Hugging Face):

python

Copy

from transformers import pipeline

generator = pipeline(“text-generation”, model=”gpt2″)

output = generator(“Hello, world!”, max_length=50)

print(output[0][“generated_text”])

Use Case: Text generation, summarization, or chatbots.

57. Handle real-time data streams with Python (e.g., Kafka, Apache Flink).
Use kafka-python for Kafka integration.
Example:

python

Copy

from kafka import KafkaConsumer

consumer = KafkaConsumer(“topic”, bootstrap_servers=”localhost:9092″)

for message in consumer:

print(message.value.decode(“utf-8”))

Flink: Use pyflink for stream processing on large-scale data.

System Design

58. Design a caching system for a high-traffic web app.
Use Redis for in-memory caching with TTL.
Components:

Client: Sends requests to the app.
App: Checks Redis cache; if miss, queries DB and caches result.
Redis: Stores key-value pairs.
Example:

python

Copy

import redis

cache = redis.Redis(host=”localhost”, port=6379)

def get_data(key):

if cached := cache.get(key):

return cached.decode()

data = query_db(key) # Expensive DB call

cache.setex(key, 3600, data) # Cache for 1 hour

return data

59. How would you architect a scalable ETL pipeline using Airflow?

Extract: Pull data from APIs/databases (Pandas, SQLAlchemy).
Transform: Clean and aggregate (Pandas, PySpark).
Load: Store in a data warehouse (Snowflake, BigQuery).
Airflow: Schedule and monitor DAGs.
Example DAG:

python

Copy

from airflow import DAG

from airflow.operators.python import PythonOperator

from datetime import datetime

def extract(): pass

def transform(): pass

def load(): pass

with DAG(“etl_pipeline”, start_date=datetime(2025, 1, 1), schedule_interval=”@daily”) as dag:

t1 = PythonOperator(task_id=”extract”, python_callable=extract)

t2 = PythonOperator(task_id=”transform”, python_callable=transform)

t3 = PythonOperator(task_id=”load”, python_callable=load)

t1 >> t2 >> t3

Module 10: Advanced Python Concepts (Continued)

60. Write a thread-safe singleton class using __new__.

python

Copy

class Singleton:

_instance = None

_lock = threading.Lock()

def __new__(cls):

with cls._lock:

if cls._instance is None:

cls._instance = super().__new__(cls)

return cls._instance

61. How does __slots__ optimize memory in Python classes?
__slots__ defines a fixed set of attributes, preventing the creation of a __dict__ for each instance, reducing memory overhead.
Example:

python

Copy

class Point:

__slots__ = (“x”, “y”)

def __init__(self, x, y):

self.x = x

self.y = y

62. Implement a generator to yield Fibonacci numbers indefinitely.

python

Copy

def fibonacci():

a, b = 0, 1

while True:

yield a

a, b = b, a + b

fib = fibonacci()

for _ in range(5):

print(next(fib)) # 0, 1, 1, 2, 3

63. What are context managers? Write one for handling database connections.
Context managers handle setup/teardown (e.g., opening/closing resources) using __enter__ and __exit__.

python

Copy

class DatabaseConnection:

def __init__(self, db_name):

self.db_name = db_name

def __enter__(self):

self.conn = sqlite3.connect(self.db_name)

return self.conn

def __exit__(self, exc_type, exc_val, exc_tb):

self.conn.close()

with DatabaseConnection(“example.db”) as conn:

cursor = conn.cursor()

cursor.execute(“SELECT * FROM table”)

64. Explain the use of asyncio for asynchronous programming.
asyncio enables non-blocking I/O operations using async/await. Ideal for I/O-bound tasks (e.g., API calls).
Example:

python

Copy

import asyncio

async def fetch_data():

await asyncio.sleep(1) # Simulate I/O

return “Data”

async def main():

result = await fetch_data()

print(result)

asyncio.run(main())

Module 11: Hands-on Projects

65. Design a CLI tool to scrape website data and save it to CSV.

python

Copy

import click

import requests

from bs4 import BeautifulSoup

import pandas as pd

@click.command()

@click.option(“–url”, default=”https://example.com”, help=”Website URL”)

def scrape(url):

response = requests.get(url)

soup = BeautifulSoup(response.text, “html.parser”)

data = [{“text”: p.text} for p in soup.find_all(“p”)]

pd.DataFrame(data).to_csv(“output.csv”, index=False)

print(“Saved to output.csv”)

if __name__ == “__main__”:

scrape()

66. How would you build a sentiment analysis model using Python libraries?
Use transformers for a pre-trained model.

python

Copy

from transformers import pipeline

classifier = pipeline(“sentiment-analysis”)

text = “I love this product!”

result = classifier(text)

print(result) # [{‘label’: ‘POSITIVE’, ‘score’: 0.999}]

67. Create a Flask API to predict housing prices using a trained model.

python

Copy

from flask import Flask, request, jsonify

import joblib

app = Flask(__name__)

model = joblib.load(“housing_model.pkl”)

@app.route(“/predict”, methods=[“POST”])

def predict():

data = request.json[“features”]

prediction = model.predict([data])[0]

return jsonify({“price”: prediction})

if __name__ == “__main__”:

app.run()

Data Structures & Algorithms

68. Implement a stack using Python lists.

python

Copy

class Stack:

def __init__(self):

self.items = []

def push(self, item):

self.items.append(item)

def pop(self):

return self.items.pop()

def peek(self):

return self.items[-1]

def is_empty(self):

return len(self.items) == 0

69. Write code to detect cycles in a linked list.

python

Copy

class ListNode:

def __init__(self, val=0, next=None):

self.val = val

self.next = next

def has_cycle(head):

slow = fast = head

while fast and fast.next:

slow = slow.next

fast = fast.next.next

if slow == fast:

return True

return False

70. Solve the “Two Sum” problem with optimal time complexity.

python

Copy

def two_sum(nums, target):

seen = {}

for i, num in enumerate(nums):

complement = target – num

if complement in seen:

return [seen[complement], i]

seen[num] = i

return []

print(two_sum([2, 7, 11, 15], 9)) # [0, 1]

Complexity: O(n) using a hash map.

71. Traverse a binary tree in pre-order without recursion.

python

Copy

class TreeNode:

def __init__(self, val=0, left=None, right=None):

self.val = val

self.left = left

self.right = right

def preorder_traversal(root):

if not root:

return []

result = []

stack = [root]

while stack:

node = stack.pop()

result.append(node.val)

if node.right:

stack.append(node.right)

if node.left:

stack.append(node.left)

return result

72. Sort a list of dictionaries by a specific key.

python

Copy

data = [{“name”: “Bob”, “age”: 30}, {“name”: “Alice”, “age”: 25}]

sorted_data = sorted(data, key=lambda x: x[“age”])

print(sorted_data) # [{‘name’: ‘Alice’, ‘age’: 25}, {‘name’: ‘Bob’, ‘age’: 30}]

Advanced Data Science & Libraries

73. How does Pandas handle categorical data? When is it useful?
Pandas uses the category dtype to store data with few unique values, reducing memory and speeding up operations like grouping.
Example:

python

Copy

import pandas as pd

df = pd.DataFrame({“color”: [“red”, “blue”, “red”]})

df[“color”] = df[“color”].astype(“category”)

Use Case: Columns with repetitive values (e.g., gender, status).

74. Reshape a NumPy array from (3, 4) to (4, 3) without copying data.

python

Copy

import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

reshaped = arr.T # Transpose, no copy

print(reshaped.shape) # (4, 3)

75. Handle imbalanced datasets in a classification problem.

Resampling: Oversample minority (SMOTE) or undersample majority.
Class Weights: Adjust weights in the model (e.g., Scikit-learn’s class_weight).
Evaluation Metrics: Use precision, recall, or F1-score instead of accuracy.
Example (SMOTE):

python

Copy

from imblearn.over_sampling import SMOTE

X_res, y_res = SMOTE().fit_resample(X, y)

76. Use groupby in Pandas to calculate aggregate metrics.

python

Copy

import pandas as pd

df = pd.DataFrame({“category”: [“A”, “A”, “B”], “value”: [10, 20, 30]})

result = df.groupby(“category”)[“value”].mean()

print(result)

# category

# A 15.0

# B 30.0

77. Optimize a Pandas apply function with vectorization.
Replace apply with vectorized operations for speed.
Example:

python

Copy

import pandas as pd

df = pd.DataFrame({“A”: [1, 2, 3]})

# Slow

df[“B”] = df[“A”].apply(lambda x: x * 2)

# Fast

df[“B”] = df[“A”] * 2

OOP & Design Patterns

78. Implement the Factory pattern for a logging system.

python

Copy

class Logger:

def log(self, message):

pass

class ConsoleLogger(Logger):

def log(self, message):

print(f”Console: {message}”)

class FileLogger(Logger):

def log(self, message):

with open(“log.txt”, “a”) as f:

f.write(f”File: {message}\n”)

class LoggerFactory:

@staticmethod

def create_logger(logger_type):

if logger_type == “console”:

return ConsoleLogger()

elif logger_type == “file”:

return FileLogger()

raise ValueError(“Unknown logger”)

logger = LoggerFactory.create_logger(“console”)

logger.log(“Test”)

79. What is the Observer pattern? Write a Python example.
The Observer pattern allows objects (observers) to be notified of changes in a subject.

python

Copy

class Subject:

def __init__(self):

self._observers = []

def attach(self, observer):

self._observers.append(observer)

def notify(self, message):

for observer in self._observers:

observer.update(message)

class Observer:

def update(self, message):

print(f”Received: {message}”)

subject = Subject()

observer = Observer()

subject.attach(observer)

subject.notify(“Event occurred”)

80. Design a RESTful class hierarchy using inheritance.

python

Copy

class Resource:

def get(self, id):

return {“id”: id}

class UserResource(Resource):

def get(self, id):

return {“id”: id, “type”: “user”}

class ProductResource(Resource):

def get(self, id):

return {“id”: id, “type”: “product”}

Exception Handling & Debugging

81. Write a decorator to retry failed API calls 3 times.

python

Copy

import time

import functools

def retry(max_attempts=3, delay=1):

def decorator(func):

@functools.wraps(func)

def wrapper(*args, **kwargs):

for attempt in range(max_attempts):

try:

return func(*args, **kwargs)

except Exception as e:

if attempt == max_attempts – 1:

raise

time.sleep(delay)

return wrapper

return decorator

@retry()

def fetch_api():

raise ValueError(“API failed”)

82. How would you log exceptions to a file in production?
Use the logging module with a file handler.

python

Copy

import logging

logging.basicConfig(filename=”app.log”, level=logging.ERROR)

try:

1 / 0

except Exception as e:

logging.exception(“An error occurred”)

83. Debug a memory leak in a long-running Python process.

Use tracemalloc to track allocations.
Check for unclosed resources (files, connections).
Monitor with psutil for process memory usage.
Example:

python

Copy

import tracemalloc

tracemalloc.start()

# Code

snapshot = tracemalloc.take_snapshot()

print(snapshot.statistics(“lineno”)[:3])

Cloud & DevOps Integration

84. Containerize a Python app using Docker.
Dockerfile:

dockerfile

Copy

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY . .

CMD [“python”, “app.py”]

Build/Run:

bash

Copy

docker build -t my-app .

docker run -p 5000:5000 my-app

85. Automate deployment to AWS EC2 using GitHub Actions.
Workflow:

yaml

Copy

on:

push:

branches: [main]

jobs:

deploy:

runs-on: ubuntu-latest

steps:

– uses: actions/checkout@v3

– name: Deploy to EC2

env:

EC2_HOST: ${{ secrets.EC2_HOST }}

EC2_USER: ${{ secrets.EC2_USER }}

EC2_KEY: ${{ secrets.EC2_KEY }}

run: |

echo “$EC2_KEY” > key.pem

chmod 400 key.pem

ssh -i key.pem $EC2_USER@$EC2_HOST ‘bash -s’ < deploy.sh

86. Monitor a Python microservice’s performance with Prometheus.
Use prometheus_client to expose metrics.

python

Copy

from prometheus_client import Counter, start_http_server

requests_total = Counter(“requests_total”, “Total requests”)

start_http_server(8000)

@app.route(“/endpoint”)

def endpoint():

requests_total.inc()

return “OK”

2025 Trends & Tools

87. How would you use Python for quantum computing simulations?
Use Qiskit for quantum circuit simulations.

python

Copy

from qiskit import QuantumCircuit, Aer, execute

qc = QuantumCircuit(2, 2)

qc.h(0)

qc.cx(0, 1)

qc.measure([0, 1], [0, 1])

simulator = Aer.get_backend(“qasm_simulator”)

result = execute(qc, simulator, shots=1024).result()

print(result.get_counts())

88. Implement a real-time dashboard with Plotly Dash.

python

Copy

from dash import Dash, html, dcc

import plotly.express as px

app = Dash(__name__)

df = px.data.iris()

fig = px.scatter(df, x=”sepal_width”, y=”sepal_length”)

app.layout = html.Div([dcc.Graph(figure=fig)])

app.run_server(debug=True)

89. Fine-tune a Hugging Face transformer model for text generation.

python

Copy

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer

model = AutoModelForCausalLM.from_pretrained(“gpt2”)

tokenizer = AutoTokenizer.from_pretrained(“gpt2”)

# Fine-tune with dataset

trainer = Trainer(model=model, train_dataset=dataset)

trainer.train()

90. Use Python to automate infrastructure provisioning with Terraform.
Use python-terraform to interact with Terraform.

python

Copy

from python_terraform import Terraform

tf = Terraform(working_dir=”.”)

tf.init()

tf.apply(auto_approve=True)

System Design & Scalability

91. Design a distributed task queue using Celery.
Components:

Celery Workers: Process tasks.
Broker: RabbitMQ/Redis for task queuing.
Backend: Store results.
Example:

python

Copy

from celery import Celery

app = Celery(“tasks”, broker=”redis://localhost:6379″, backend=”redis://localhost:6379″)

@app.task

def process_data(data):

return data * 2

92. Architect a serverless data pipeline with AWS Lambda.

Trigger: S3 file upload.
Lambda: Process data (e.g., Pandas).
Output: Store in DynamoDB.
Example Lambda:

python

Copy

import json

import boto3

def lambda_handler(event, context):

s3 = boto3.client(“s3”)

bucket = event[“Records”][0][“s3”][“bucket”][“name”]

key = event[“Records”][0][“s3”][“object”][“key”]

# Process file

return {“status”: “Processed”}

93. How would you shard a MySQL database for a Python app?

Strategy: Partition data by key (e.g., user_id).
Implementation: Use SQLAlchemy with multiple database connections.
Example: Route queries based on shard key.

python

Copy

from sqlalchemy import create_engine

shards = {0: create_engine(“mysql://user:pass@host1/db”), 1: create_engine(“mysql://user:pass@host2/db”)}

def get_shard(user_id):

return shards[user_id % 2]

Security & Ethics

94. Prevent SQL injection in raw SQL queries with Python.
Use parameterized queries with SQLAlchemy or mysql-connector.

python

Copy

import sqlite3

conn = sqlite3.connect(“example.db”)

cursor = conn.cursor()

user_input = “malicious; DROP TABLE users;”

cursor.execute(“SELECT * FROM users WHERE name = ?”, (user_input,))

95. Implement rate limiting in a Flask API.
Use flask-limiter.

python

Copy

from flask import Flask

from flask_limiter import Limiter

app = Flask(__name__)

limiter = Limiter(app, default_limits=[“100 per day”])

@app.route(“/endpoint”)

@limiter.limit(“5 per minute”)

def endpoint():

return “OK”

96. How do you ensure GDPR compliance in data processing scripts?

Consent: Store user consent for data processing.
Anonymization: Remove PII (e.g., names, emails).
Data Deletion: Implement mechanisms to delete user data on request.
Audit Logs: Track data access.
Example: Anonymize data with hashlib.

python

Copy

import hashlib

email = “user@example.com”

anonymized = hashlib.sha256(email.encode()).hexdigest()

Testing & Best Practices

97. Write unit tests for a function using pytest mocks.

python

Copy

from unittest.mock import patch

import pytest

def fetch_data():

return “Real data”

@patch(“module.fetch_data”)

def test_fetch_data(mock_fetch):

mock_fetch.return_value = “Mocked data”

assert fetch_data() == “Mocked data”

98. Use mypy for static type checking in a legacy codebase.
Add type hints and run mypy.

python

Copy

def add(a: int, b: int) -> int:

return a + b

Command: mypy script.py

99. How do you enforce PEP8 compliance in a CI pipeline?
Use flake8 in GitHub Actions.

yaml

Copy

on: [push]

jobs:

lint:

runs-on: ubuntu-latest

steps:

– uses: actions/checkout@v3

– uses: actions/setup-python@v4

– run: pip install flake8

– run: flake8 .

Final Question

100. Build a Python pipeline to process 1TB of data on a budget. What tools and strategies would you use?
Strategy:

Tools:
- Dask/PySpark: For distributed processing.
- AWS S3: Store raw data (cost: ~$0.023/GB/month).
- AWS EC2 Spot Instances: Cheap compute (~70% savings).
- Pandas: For small-scale prototyping.
Pipeline:

Extract: Read data from S3 in chunks (Dask).
Transform: Filter, aggregate, or clean (e.g., remove nulls).
Load: Save results back to S3 or a database.
Optimization:

Use compressed formats (Parquet).
Partition data by key (e.g., date).
Monitor costs with AWS Budgets.
Example (Dask):

python

Copy

import dask.dataframe as dd

df = dd.read_parquet(“s3://bucket/data/*.parquet”)

result = df.groupby(“category”).sum().compute()

result.to_parquet(“s3://bucket/output/”)

Cost Estimate: ~$50-100 for 1TB processing using spot instances and S3.

Leave a Reply Cancel reply

Related News

What is a neural network, and what are its core components (e.g., layers, weights, biases)?

AI Agent vs. Agentic AI

How many TRIGGERS are allowed in MySql table?

You may have missed

What are the key features of Python?

What is a neural network, and what are its core components (e.g., layers, weights, biases)?

AI Agent vs. Agentic AI

How many TRIGGERS are allowed in MySql table?