✅ 1. List vs. Tuple — Simple Explanation

List

A list is like a shopping list.
You can add, remove, or change items anytime.
Mutable (changeable).

Tuple

A tuple is like GPS coordinates (latitude, longitude).
Once created, you cannot modify it.
Immutable (not changeable).
Slightly faster and memory efficient than a list.

When to Use?

Use List → when data needs modification.
Use Tuple → when data should stay constant and safe from changes.

✅ Code Example

# A list of daily tasks - we might want to add or remove tasks
daily_tasks = ["email", "meeting", "coding"]
print(f"Original list: {daily_tasks}")

# We can easily change an item
daily_tasks[1] = "code review"
print(f"Modified list: {daily_tasks}")

# A tuple of server coordinates - this should not change
server_location = (40.7128, -74.0060)  # (Latitude, Longitude)
print(f"\nOriginal tuple: {server_location}")

# Trying to change a tuple will cause an error
try:
    server_location[0] = 34.0522
except TypeError as e:
    print(f"Error trying to modify tuple: {e}")

✅ Output

Original list: ['email', 'meeting', 'coding']
Modified list: ['email', 'code review', 'coding']

Original tuple: (40.7128, -74.0060)
Error trying to modify tuple: 'tuple' object does not support item assignment

✅ 2. Dictionary Operations — Simple Explanation

A dictionary stores data as key–value pairs.
You can loop through these pairs and apply conditions to filter the data.

Example:
Looping through user profiles and checking which users are older than 30.

✅ Code Example

# A dictionary where keys are user IDs and values are their profiles
user_profiles = {
    "u101": {"name": "Alice", "age": 34},
    "u102": {"name": "Bob", "age": 25},
    "u103": {"name": "Charlie", "age": 42},
    "u104": {"name": "Diana", "age": 29}
}

# Find all users older than 30
users_over_30 = [
    user_id for user_id, profile in user_profiles.items() if profile["age"] > 30
]

print(f"Users older than 30: {users_over_30}")

✅ Output

Users older than 30: ['u101', 'u103']

✅ 3. List Comprehension — Simple Explanation

A list comprehension is a short, readable way to create a list.
It replaces long for loops and makes code cleaner.

Code Example

# The original for loop
squares_loop = []
for x in range(10):
    squares_loop.append(x * x)

# The equivalent list comprehension
squares_comp = [x * x for x in range(10)]

print(f"Using a for loop: {squares_loop}")
print(f"Using a list comprehension: {squares_comp}")

Output

Using a for loop: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Using a list comprehension: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

✅ 4. Lambda Functions — Simple Explanation

A lambda function is a small, one-line anonymous function.
It is useful for short operations, especially with filter(), map(), sorted(), reduce(), etc.

Code Example

numbers = [10, -5, 22, -1, 0, 15, -8]

# Use filter() with a lambda function to keep only non-negative numbers
positive_numbers = list(filter(lambda x: x >= 0, numbers))

print(f"Original list: {numbers}")
print(f"List with negatives removed: {positive_numbers}")

Output

Original list: [10, -5, 22, -1, 0, 15, -8]
List with negatives removed: [10, 22, 0, 15]

✅ 5. Set Operations — Simple Explanation

A set is a collection of unique items.
You can do mathematical operations like difference, union, intersection, etc.

Example:
Find customers who are in list A but not in list B.

Code Example

list_a = [101, 102, 103, 104, 105]
list_b = [104, 105, 106, 107]

# Convert lists to sets to perform set operations
set_a = set(list_a)
set_b = set(list_b)

# Find customers in list_a but not in list_b (set difference)
customers_only_in_a = list(set_a - set_b)

print(f"List A: {list_a}")
print(f"List B: {list_b}")
print(f"Customers only in List A: {customers_only_in_a}")

Output

List A: [101, 102, 103, 104, 105]
List B: [104, 105, 106, 107]
Customers only in List A: [101, 102, 103]

✅ 6. Memory Efficiency (Generators)

Simple Explanation:

A list stores all values in memory at once → uses a lot of memory for large datasets.
A generator produces values one at a time, only when needed → extremely memory-efficient.

Code Example

import sys

# Create a list of the first 1,000,000 numbers
my_list = [i for i in range(1_000_000)]
print(f"Size of list: {sys.getsizeof(my_list)} bytes")

# Create a generator for the first 1,000,000 numbers
my_generator = (i for i in range(1_000_000))
print(f"Size of generator: {sys.getsizeof(my_generator)} bytes")

# Getting values from a generator
print("\nFirst 5 values from generator:")
for i in range(5):
    print(next(my_generator))

Output

Size of list: 8000056 bytes
Size of generator: 200 bytes

First 5 values from generator:
0
1
2
3
4

✅ 7. Functions

Simple Explanation:

A function is a reusable block of code that:

takes inputs (arguments)
performs a task
returns an output

Code Example

import statistics

def calculate_stats(numbers):
    """Calculates mean, median, and standard deviation for a list of numbers."""
    if not numbers:
        return "The list is empty."
    
    mean = statistics.mean(numbers)
    median = statistics.median(numbers)
    stdev = statistics.stdev(numbers)
    
    return {"mean": mean, "median": median, "stdev": stdev}

# Example usage
data = [15, 22, 28, 30, 35, 41, 50]
stats = calculate_stats(data)
print(f"Stats for {data}:")
print(stats)

Output

Stats for [15, 22, 28, 30, 35, 41, 50]:
{'mean': 31.57142857142857, 'median': 30, 'stdev': 11.799713686842571}

✅ 8. args and kwargs*

Simple Explanation:

*args → collects positional arguments into a tuple
Example: func(1, 2, 3)
**kwargs → collects keyword arguments into a dictionary
Example: func(name="Alice", age=30)
They help create flexible functions.

Code Example

import pandas as pd

def create_dataframe(*args, **kwargs):
    """
    Creates a DataFrame.
    *args = positional arguments
    **kwargs = column data
    """
    print("--- Arguments received ---")
    print(f"Positional args (*args): {args}")
    print(f"Keyword args (**kwargs): {kwargs}")
    print("--------------------------")
    
    return pd.DataFrame(data=kwargs)

# Example usage
df = create_dataframe(
    "name", "age", "city",  # Positional (*args)
    name=["Alice", "Bob"],  # Keyword (**kwargs)
    age=[34, 25],
    city=["New York", "Los Angeles"]
)

print("Resulting DataFrame:")
print(df)

Output

--- Arguments received ---
Positional args (*args): ('name', 'age', 'city')
Keyword args (**kwargs): {'name': ['Alice', 'Bob'], 'age': [34, 25], 'city': ['New York', 'Los Angeles']}
--------------------------
Resulting DataFrame:
    name  age         city
0  Alice   34     New York
1    Bob   25  Los Angeles

✅ 9. Error Handling

Simple Explanation:

A try...except block lets you run risky code safely.
If an error occurs, Python executes the except block instead of crashing.

Code Example

import pandas as pd

file_path = "data/my_non_existent_file.csv"

try:
    df = pd.read_csv(file_path)
    print("File loaded successfully!")
    print(df.head())

except FileNotFoundError:
    print(f"Error: The file at '{file_path}' was not found.")
    print("Please check the file path and try again.")

Output

Error: The file at 'data/my_non_existent_file.csv' was not found.
Please check the file path and try again.

100 Python practical interview questions

✅ 10. Classes (OOP)

Simple Explanation:

A class is a blueprint for creating objects.
Objects have:

attributes (data)
methods (functions)

A DataPipeline class can group extract → transform → load steps.

Code Example

class DataPipeline:
    def __init__(self, source_file):
        """Initializes the pipeline with a source file."""
        self.source_file = source_file
        self.data = None
        print(f"Pipeline initialized for source: {self.source_file}")

    def extract(self):
        print("Step 1: Extracting data...")
        self.data = {"col1": [1, 2], "col2": [3, 4]}
        print("Extraction complete.")

    def transform(self):
        print("Step 2: Transforming data...")
        if self.data:
            self.data["col1"] = [x * 10 for x in self.data["col1"]]
        print("Transformation complete.")

    def load(self):
        print("Step 3: Loading data...")
        if self.data:
            print(f"Data to be loaded: {self.data}")
        print("Load complete.")

# Using the class
print("--- Creating a pipeline instance ---")
pipeline = DataPipeline("sales_data.csv")

print("\n--- Running the pipeline ---")
pipeline.extract()
pipeline.transform()
pipeline.load()

Output

--- Creating a pipeline instance ---
Pipeline initialized for source: sales_data.csv

--- Running the pipeline ---
Step 1: Extracting data...
Extraction complete.
Step 2: Transforming data...
Transformation complete.
Step 3: Loading data...
Data to be loaded: {'col1': [10, 20], 'col2': [3, 4]}
Load complete.

✅ 11. Shallow vs Deep Copy

Simple Explanation

A shallow copy creates a new object but does NOT copy nested objects.
→ Both the original and the copy share the same nested elements.
A deep copy creates a new object and recursively copies everything inside.
→ The original and copy are completely independent.

Why important for Pandas?

df.copy(deep=False) → shallow copy, risky.
Changes may reflect in the original DataFrame.
df.copy() or df.copy(deep=True) → safe, original DataFrame remains untouched.

Code Example

import copy
import pandas as pd

# --- Example with a list of lists ---
original_list = [[1, 2, 3], [4, 5, 6]]

# Shallow copy
shallow_copy_list = copy.copy(original_list)

# Deep copy
deep_copy_list = copy.deepcopy(original_list)

print("--- Modifying a nested list in the SHALLOW copy ---")
shallow_copy_list[0][0] = 99
print(f"Original list: {original_list}")  # Changed!
print(f"Shallow copy: {shallow_copy_list}")

print("\n--- Modifying a nested list in the DEEP copy ---")
deep_copy_list[0][0] = 88
print(f"Original list: {original_list}")  # Unchanged
print(f"Deep copy: {deep_copy_list}")

# --- Example with Pandas DataFrames ---
df_original = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df_shallow = df_original.copy(deep=False)
df_deep = df_original.copy()

print("\n--- Modifying the SHALLOW DataFrame copy ---")
df_shallow.loc[0, 'col1'] = 99
print(f"Original DataFrame:\n{df_original}")
print(f"Shallow DataFrame:\n{df_shallow}")

Output

--- Modifying a nested list in the SHALLOW copy ---
Original list: [[99, 2, 3], [4, 5, 6]]
Shallow copy: [[99, 2, 3], [4, 5, 6]]

--- Modifying a nested list in the DEEP copy ---
Original list: [[99, 2, 3], [4, 5, 6]]
Deep copy: [[88, 2, 3], [4, 5, 6]]

--- Modifying the SHALLOW DataFrame copy ---
Original DataFrame:
   col1  col2
0     1     3
1     2     4
Shallow DataFrame:
   col1  col2
0    99     3
1     2     4

✅ 12. Working with Files (Reading Large Files)

Simple Explanation

When reading very large files, don’t load the whole file into memory.
Use:

with open(...) as f:
    for line in f:
        ...

This reads one line at a time, which is very memory-efficient.

Code Example

# First, let's create a dummy large file
file_path = "large_file.txt"
with open(file_path, "w") as f:
    for i in range(100):
        f.write(f"This is line number {i+1} of the file.\n")

# Now, read it line by line
print(f"Reading '{file_path}' line by line:")
try:
    with open(file_path, "r") as f:
        for i, line in enumerate(f):
            print(f"Line {i+1}: {line.strip()}")
            if i >= 2:  # Stop after 3 lines
                break

except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")

Output

Reading 'large_file.txt' line by line:
Line 1: This is line number 1 of the file.
Line 2: This is line number 2 of the file.
Line 3: This is line number 3 of the file.

✅ 13. Virtual Environments

Simple Explanation

A virtual environment is an isolated Python environment.
Each project can have its own versions of:

Python packages
Dependencies
Library versions

This prevents version conflicts between projects.

Commands (bash)

# 1. Create a virtual environment named 'my_project_env'
python3 -m venv my_project_env

# 2. Activate the environment
# macOS/Linux:
source my_project_env/bin/activate

# Windows:
# my_project_env\Scripts\activate

# 3. Install packages
(my_project_env) $ pip install pandas numpy

# 4. Deactivate
(my_project_env) $ deactivate

Conceptual Output

$ python --version
Python 3.9.6

$ source my_project_env/bin/activate
(my_project_env) $ python --version
Python 3.9.6

(my_project_env) $ pip list | grep pandas
# No output (not installed)

(my_project_env) $ pip install pandas
Successfully installed pandas-1.5.3

(my_project_env) $ pip list | grep pandas
pandas 1.5.3

(my_project_env) $ deactivate

$ pip list | grep pandas
# No output (global environment)

✅ 14. pathlib Module

Simple Explanation

pathlib provides object-oriented file path handling:

works on Windows, Linux, macOS
cleaner and safer than using plain strings
powerful file operations (rglob, mkdir, / operator)

Code Example

from pathlib import Path

# Create 'data' directory and sample files
data_dir = Path("data")

data_dir.mkdir(exist_ok=True)
(data_dir / "sales.csv").touch()
(data_dir / "customers.csv").touch()
(data_dir / "reports").mkdir(exist_ok=True)
(data_dir / "reports" / "summary.csv").touch()

# Recursively find all CSV files
print(f"Searching for CSV files in '{data_dir}' and its subdirectories:")
csv_files = list(data_dir.rglob("*.csv"))

for file_path in csv_files:
    print(file_path)

Output

Searching for CSV files in 'data' and its subdirectories:
data/sales.csv
data/customers.csv
data/reports/summary.csv

✅ 15. String Manipulation

Simple Explanation

Pandas provides a .str accessor, which allows applying vectorized string operations to an entire column at once.

For standardizing names:

df['city'].str.title()

Code Example

import pandas as pd

# Create a DataFrame with inconsistent city names
data = {'city': ['new york', 'New York', 'NEW YORK', 'london', 'LONDON', 'Paris']}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Standardize using .str.title()
df['city_standardized'] = df['city'].str.title()

print("\nDataFrame after standardization:")
print(df)

Output

Original DataFrame:
         city
0    new york
1    New York
2    NEW YORK
3      london
4      LONDON
5       Paris

DataFrame after standardization:
  city_standardized        city
0          New York    new york
1          New York    New York
2          New York    NEW YORK
3           London      london
4           London      LONDON
5            Paris       Paris

✅ 16. Array Creation (NumPy)

Simple Explanation

np.arange() creates a sequence of numbers (like Python range, but as a NumPy array).
.reshape() changes the shape of the array, e.g. from 1D → 2D.

Code

import numpy as np

# np.arange(9) creates a 1D array with numbers 0 to 8
arr_1d = np.arange(9)
print("1D array:")
print(arr_1d)

# Convert to a 3x3 matrix
arr_2d = arr_1d.reshape(3, 3)
print("\nReshaped to 3x3 array:")
print(arr_2d)

Output

1D array:
[0 1 2 3 4 5 6 7 8]

Reshaped to 3x3 array:
[[0 1 2]
 [3 4 5]
 [6 7 8]]

✅ 17. Array Indexing

Simple Explanation

NumPy uses zero-based indexing.
Access an element in a 2D array using:

array[row_index, column_index]

Code

import numpy as np

arr = np.array([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90]])

print("Original array:")
print(arr)

element = arr[1, 2]  # second row, third column → 60
print(f"\nThe element at arr[1, 2] is: {element}")

Output

Original array:
[[10 20 30]
 [40 50 60]
 [70 80 90]]

The element at arr[1, 2] is: 60

✅ 18. Boolean Indexing

Simple Explanation

Apply a condition to create a True/False mask.
Use the mask to filter only the elements that match the condition.

Code

import numpy as np

arr = np.array([1, 5, 10, 15, 20, 25])
print(f"Original array: {arr}")

mean_val = arr.mean()
print(f"Mean: {mean_val}")

mask = arr > mean_val     # Boolean mask
print(f"Mask: {mask}")

filtered_arr = arr[mask]
print(f"Filtered array: {filtered_arr}")

Output

Original array: [ 1  5 10 15 20 25]
Mean: 12.666666666666666
Mask: [False False False  True  True  True]
Filtered array: [15 20 25]

✅ 19. Vectorization

Simple Explanation

Vectorization = doing operations on the entire array at once.
NumPy operations run in fast C code, making them much faster than Python loops.

Code

import numpy as np
import timeit

python_list = list(range(1_000_000))
numpy_array = np.arange(1_000_000)

def python_loop():
    new_list = []
    for x in python_list:
        new_list.append(x * 2)
    return new_list

def numpy_vectorized():
    return numpy_array * 2

time_python = timeit.timeit(python_loop, number=10)
time_numpy = timeit.timeit(numpy_vectorized, number=10)

print(f"Python loop: {time_python:.4f} sec")
print(f"NumPy vectorized: {time_numpy:.4f} sec")
print(f"NumPy is {time_python / time_numpy:.0f}x faster")

Output

Python loop: 0.9753 sec
NumPy vectorized: 0.0061 sec
NumPy is 160x faster

100 Python practical interview questions

✅ 20. Reshaping

Simple Explanation

.reshape() changes the shape of the array.
Total number of elements must match.
Use -1 to allow NumPy to auto-calculate remaining dimension.

Code

import numpy as np

arr_1d = np.arange(12)
print("Original 1D array:", arr_1d)

arr_3x4 = arr_1d.reshape(3, 4)
print("\nReshaped to 3x4:")
print(arr_3x4)

arr_4x3 = arr_1d.reshape(4, -1)  # NumPy calculates -1 → 3
print("\nReshaped to 4x3 using -1:")
print(arr_4x3)

Output

Original 1D array: [ 0  1  2  3  4  5  6  7  8  9 10 11]

Reshaped to 3x4:
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

Reshaped to 4x3 using -1:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

✅ 21. Array Operations (Clean & Simple Explanation)

NumPy allows you to perform element-wise mathematical operations on entire arrays without loops.

If two arrays have the same shape, you can:

Add them using +
Multiply them using *
Subtract them using -
Divide them using /

These operations happen element by element.

✔ Code Example

import numpy as np

# Create two 2D arrays of the same shape
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])

print("Array 1:")
print(arr1)

print("\nArray 2:")
print(arr2)

# Calculate the element-wise sum
sum_arr = arr1 + arr2
print("\nElement-wise sum (arr1 + arr2):")
print(sum_arr)

# Calculate the element-wise product
product_arr = arr1 * arr2
print("\nElement-wise product (arr1 * arr2):")
print(product_arr)

✔ Output

Array 1:
[[1 2]
 [3 4]]

Array 2:
[[5 6]
 [7 8]]

Element-wise sum (arr1 + arr2):
[[ 6  8]
 [10 12]]

Element-wise product (arr1 * arr2):
[[ 5 12]
 [21 32]]

🎯 Why this is useful in data science?

Fast matrix calculations
Image processing
Vectorized ML operations
Neural network computations

✅ 22. Broadcasting (Simple Explanation)

Broadcasting allows NumPy to perform operations on arrays of different shapes by automatically “stretching” the smaller array so both arrays become compatible.

NumPy does not copy the data — it just treats the smaller array as if it were repeated.

✔ When broadcasting happens

An operation like addition works if:

Dimensions are equal, or
One of them is 1, so it can be stretched

✔ Code Example

import numpy as np

# Create a 2D array (3 rows, 1 column)
arr_2d = np.array([[1], [2], [3]])
print("2D Array (3x1):")
print(arr_2d)

# Create a 1D array (1 row, 3 columns)
arr_1d = np.array([10, 20, 30])
print("\n1D Array (1x3):")
print(arr_1d)

# Add them together using broadcasting
result = arr_2d + arr_1d
print("\nResult of broadcasting addition (3x3):")
print(result)

✔ Correct Output

2D Array (3x1):
[[1]
 [2]
 [3]]

1D Array (1x3):
[10 20 30]

Result of broadcasting addition (3x3):
[[11 21 31]
 [12 22 32]
 [13 23 33]]

🎯 Why Broadcasting Is Useful?

Eliminates loops
Makes vectorized operations possible
Important in machine learning (matrix operations)
Used in image processing, normalization, scaling

✅ 23. Aggregation Functions

Simple Explanation:

NumPy allows you to compute summary statistics like mean, sum, and standard deviation along a specific axis.

axis = 0 → operate down the columns (output becomes 1 row)
axis = 1 → operate across the rows (output becomes 1 column)

✔ Code

import numpy as np

# Create a 2D array
arr = np.array([[1, 8, 3],
                [4, 5, 6],
                [7, 2, 9]])
print("Original Array:")
print(arr)

# Calculate the mean for each column (axis=0)
col_mean = np.mean(arr, axis=0)
print(f"\nMean of each column (axis=0): {col_mean}")

# Calculate the sum for each row (axis=1)
row_sum = np.sum(arr, axis=1)
print(f"Sum of each row (axis=1): {row_sum}")

# Calculate the standard deviation for each column (axis=0)
col_std = np.std(arr, axis=0)
print(f"Std deviation of each column (axis=0): {col_std}")

✔ Output

Original Array:
[[1 8 3]
 [4 5 6]
 [7 2 9]]

Mean of each column (axis=0): [4. 5. 6.]
Sum of each row (axis=1): [12 15 18]
Std deviation of each column (axis=0): [2.44948974 2.44948974 2.44948974]

✅ 24. Stacking

Simple Explanation:

Stacking means combining arrays together.

Vertical stacking (np.vstack) places arrays on top of each other, increasing the number of rows.
All arrays must have the same number of columns.

✔ Code

import numpy as np

# Create two 2D arrays with the same number of columns
arr1 = np.array([[1, 2, 3],
                 [4, 5, 6]])

arr2 = np.array([[7, 8, 9],
                 [10, 11, 12]])

print("Array 1:")
print(arr1)

print("\nArray 2:")
print(arr2)

# Vertically stack the two arrays
stacked_arr = np.vstack((arr1, arr2))
print("\nVertically stacked array:")
print(stacked_arr)

✔ Output

Array 1:
[[1 2 3]
 [4 5 6]]

Array 2:
[[ 7  8  9]
 [10 11 12]]

Vertically stacked array:
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

✅ 25. Linspace vs. Arange

Simple Explanation

np.arange(start, stop, step)
Creates values with a fixed step size.
➝ Stop value is NOT included.
np.linspace(start, stop, num)
Creates a fixed number of evenly spaced values.
➝ Stop value IS included.

✔ Use arange when step matters.
✔ Use linspace when number of points matters.

✔ Code

import numpy as np

# Use np.arange to get even numbers from 0 up to (but not including) 10
# The step is 2.
arr_arange = np.arange(0, 10, 2)
print(f"np.arange(0, 10, 2) -> {arr_arange}")

# Use np.linspace to get 5 points evenly spaced between 0 and 10
# The number of points is 5.
arr_linspace = np.linspace(0, 10, 5)
print(f"np.linspace(0, 10, 5) -> {arr_linspace}")

✔ Output

np.arange(0, 10, 2) -> [0 2 4 6 8]
np.linspace(0, 10, 5) -> [ 0.   2.5  5.   7.5 10. ]

26. Creating DataFrames

Simple Explanation

You can build a DataFrame in two main ways:

1. From a dictionary of lists

Keys → column names
Lists → column values
All lists must be the same length

2. From a list of dictionaries

Each dictionary → one row
Good for JSON-like data

✔ Code

import pandas as pd

# Method 1: From a dictionary of lists
data_dict = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['New York', 'Los Angeles', 'Chicago']
}
df_from_dict = pd.DataFrame(data_dict)
print("DataFrame from a dictionary of lists:")
print(df_from_dict)

# Method 2: From a list of dictionaries
data_list = [
    {'name': 'David', 'age': 40, 'city': 'Houston'},
    {'name': 'Eve', 'age': 28, 'city': 'Phoenix'},
    {'name': 'Frank', 'age': 45, 'city': 'Philadelphia'}
]
df_from_list = pd.DataFrame(data_list)
print("\nDataFrame from a list of dictionaries:")
print(df_from_list)

✔ Output

DataFrame from a dictionary of lists:
      name  age         city
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

DataFrame from a list of dictionaries:
    name  age         city
0  David   40      Houston
1    Eve   28      Phoenix
2  Frank   45  Philadelphia

✅ 27. Reading Data

Simple Explanation

pd.read_csv() is the most common way to load data in Pandas.

You can customize:

sep=’;’ → if your file uses semicolons instead of commas
encoding=’latin-1′ → useful for files with special characters like é, ç, ü
io.StringIO → allows treating a string as a file (good for demos)

Code

import pandas as pd
import io

# Simulate a CSV file with semicolon separator and special characters
csv_data = """id;name;city
1;José;São Paulo
2;François;Paris
3;Jürgen;Berlin
"""

# Use io.StringIO to treat the string as a file
df = pd.read_csv(io.StringIO(csv_data), sep=';', encoding='latin-1')

print("DataFrame read from a semicolon-separated CSV:")
print(df)

Output

DataFrame read from a semicolon-separated CSV:
   id       name         city
0   1       José    São Paulo
1   2  François        Paris
2   3    Jürgen       Berlin

✅ 28. Inspecting Data

Simple Explanation

Function	What it Does	Why it is Useful
`df.head(n)`	Shows first n rows	Quick preview of data
`df.info()`	Shows columns, non-null counts, datatypes	Detect missing values & datatype issues
`df.describe()`	Statistics for numeric columns	Understand distribution & outliers

Code

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'product': ['A', 'B', 'C', 'D', 'E'],
        'sales': [100, 150, np.nan, 200, 50],
        'price': [10.0, 15.0, 12.0, 20.0, 5.0]}

df = pd.DataFrame(data)

print("--- df.head() ---")
print(df.head())

print("\n--- df.info() ---")
df.info()

print("\n--- df.describe() ---")
print(df.describe())

Output

--- df.head() ---
  product  sales  price
0       A  100.0   10.0
1       B  150.0   15.0
2       C    NaN   12.0
3       D  200.0   20.0
4       E   50.0    5.0

--- df.info() ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
 0   product  5 non-null      object 
 1   sales    4 non-null      float64
 2   price    5 non-null      float64
dtypes: float64(2), object(1)
memory usage: 248.0+ bytes

--- df.describe() ---
             sales       price
count    4.000000   5.000000
mean   125.000000  12.400000
std     62.915295   5.176872
min     50.000000   5.000000
25%     87.500000  10.000000
50%    125.000000  12.000000
75%    162.500000  15.000000
max    200.000000  20.000000

✅ 29. Selecting Data

Simple Explanation

df['col']
Selects a single column by name → returns a Series.
df.loc[] (Label-based)
Uses row/column labels.
The end index IS inclusive.
df.iloc[] (Integer-based)
Uses row/column positions.
The end index is exclusive.

Code

import pandas as pd

df = pd.DataFrame({'product': ['A', 'B', 'C', 'D'],
                   'sales': [100, 150, 120, 200],
                   'price': [10, 15, 12, 20]},
                  index=['row_one', 'row_two', 'row_three', 'row_four'])

print("Original DataFrame:")
print(df)

# Direct bracket notation to select the 'sales' column
sales_series = df['sales']
print("\nSelecting 'sales' column with df['sales']:")
print(sales_series)

# .loc to select by label (inclusive)
loc_selection = df.loc['row_one':'row_three', 'product':'sales']
print("\nSelecting with df.loc (label-based):")
print(loc_selection)

# .iloc to select by integer position (exclusive)
iloc_selection = df.iloc[0:2, 2]
print("\nSelecting with df.iloc (integer-based):")
print(iloc_selection)

Output

Original DataFrame:
           product  sales  price
row_one          A    100     10
row_two          B    150     15
row_three        C    120     12
row_four         D    200     20

Selecting 'sales' column with df['sales']:
row_one      100
row_two      150
row_three    120
row_four     200
Name: sales, dtype: int64

Selecting with df.loc (label-based):
           product  sales
row_one          A    100
row_two          B    150
row_three        C    120

Selecting with df.iloc (integer-based):
row_one     10
row_two     15
Name: price, dtype: int64

✅ 30. Setting Index

Simple Explanation

Use df.set_index('column') to make a column the new index.
Useful when working with:
- time-series data
- faster row lookups with .loc
inplace=True modifies the DataFrame directly.

Code

import pandas as pd

# Create a DataFrame with a 'date' column
df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'sales': [200, 250, 180],
    'product': ['X', 'Y', 'Z']
})

print("Original DataFrame:")
print(df)

# Set the 'date' column as the new index
df.set_index('date', inplace=True)

print("\nDataFrame after setting 'date' as the index:")
print(df)

Output

Original DataFrame:
         date  sales product
0  2023-01-01    200       X
1  2023-01-02    250       Y
2  2023-01-03    180       Z

DataFrame after setting 'date' as the index:
            sales product
date
2023-01-01    200       X
2023-01-02    250       Y
2023-01-03    180       Z

✅ 31. Handling Missing Values

Simple Explanation

Use df.isnull() to create a boolean DataFrame where:
- True → missing value (NaN)
- False → present value
Then apply .sum() to count how many True values each column has.
Since True = 1 and False = 0, the sum gives the number of missing values.

Code

import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, np.nan, 35, 40],
        'city': ['New York', 'Los Angeles', np.nan, 'Chicago'],
        'sales': [200, 150, 300, np.nan]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Find the number of missing values in each column
missing_values = df.isnull().sum()

print("\nNumber of missing values in each column:")
print(missing_values)

Output

Original DataFrame:
      name   age         city  sales
0    Alice  25.0     New York  200.0
1      Bob   NaN  Los Angeles  150.0
2  Charlie  35.0          NaN  300.0
3    David  40.0      Chicago    NaN

Number of missing values in each column:
name     0
age      1
city     1
sales    1
dtype: int64

✅ 32. Dropping / Filling NaNs

Simple Explanation

df.dropna()
- Removes rows (or columns) that contain any missing values.
- Use when missing data is rare and losing a few rows is okay.
df.fillna(value)
- Replaces missing values with a specified value.
- Common choices:
  - mean for numerical data
  - median
  - mode for categorical data
- Use when you want to keep all rows and handle missing data logically.

Code

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, 4],
                   'B': [5, np.nan, np.nan, 8],
                   'C': [9, 10, 11, 12]})

print("Original DataFrame:")
print(df)

# --- dropna: remove any rows with missing values ---
df_dropped = df.dropna()
print("\nDataFrame after dropna():")
print(df_dropped)

# --- fillna: replace NaN with the mean of each column ---
df_filled = df.fillna(df.mean())
print("\nDataFrame after fillna(df.mean()):")
print(df_filled)

Output

Original DataFrame:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  NaN  11
3  4.0  8.0  12

DataFrame after dropna():
     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12

DataFrame after fillna(df.mean()):
     A    B   C
0  1.0  5.0   9
1  2.0  6.5  10
2  2.5  6.5  11
3  4.0  8.0  12

✅ 33. Conditional Replacement

Simple Explanation

Use boolean indexing with .loc to efficiently replace values based on a condition.
Steps:
1. Create a condition, e.g., df['inventory'] < 0 → returns a boolean Series.
2. Use .loc[condition, 'column'] to select only the rows that satisfy the condition.
3. Assign the new value to those rows.

Code

import pandas as pd

df = pd.DataFrame({'product': ['A', 'B', 'C', 'D'],
                   'inventory': [50, -10, 120, -5]})

print("Original DataFrame:")
print(df)

# Replace all negative values in the 'inventory' column with 0
df.loc[df['inventory'] < 0, 'inventory'] = 0

print("\nDataFrame after replacing negative values:")
print(df)

Output

Original DataFrame:
  product  inventory
0       A         50
1       B        -10
2       C        120
3       D         -5

DataFrame after replacing negative values:
  product  inventory
0       A         50
1       B          0
2       C        120
3       D          0

✅ 34. Data Types (Currency to Numeric)

Simple Explanation

To convert a currency string like '$1,200.50' to a numeric type:

Clean the string: remove $ and ,.
Convert to numeric: use pd.to_numeric() to make it a float.

Code

import pandas as pd

df = pd.DataFrame({'item': ['Laptop', 'Mouse'],
                   'price': ['$1,200.50', '$25.00']})

print("Original DataFrame:")
print(df)
print("\nData types:")
print(df.info())

# 1. Remove '$' and ',' using .str.replace()
df['price_cleaned'] = df['price'].str.replace('$', '').str.replace(',', '')

# 2. Convert cleaned string column to numeric (float)
df['price_numeric'] = pd.to_numeric(df['price_cleaned'])

print("\nDataFrame after conversion:")
print(df[['item', 'price_numeric']])
print("\nNew data types:")
print(df[['item', 'price_numeric']].info())

Output

Original DataFrame:
    item      price
0  Laptop  $1,200.50
1   Mouse     $25.00

Data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   item    2 non-null      object
 1   price   2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None

DataFrame after conversion:
    item  price_numeric
0  Laptop        1200.50
1   Mouse           25.00

New data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   item           2 non-null      object 
 1   price_numeric  2 non-null      float64
dtypes: float64(1), object(1)
memory usage: 160.0+ bytes
None

✅ 35. Removing Duplicates

Simple Explanation

Use df.drop_duplicates() to remove duplicate rows.
Parameters:
- subset: Specify which columns to consider when identifying duplicates.
- keep: Decide which duplicate to keep:
  - 'first' → keeps the first occurrence
  - 'last' → keeps the last occurrence
  - False → removes all duplicates
Useful for cleaning data where repeated entries are not needed.

Code

import pandas as pd

df = pd.DataFrame({
    'user_id': [1, 2, 1, 3, 2],
    'product_id': ['A', 'B', 'A', 'C', 'B'],
    'transaction_id': [101, 102, 103, 104, 105]
})

print("Original DataFrame:")
print(df)

# Remove duplicate rows based on 'user_id' and 'product_id'
# Keep the first occurrence of each duplicate
df_unique = df.drop_duplicates(subset=['user_id', 'product_id'], keep='first')

print("\nDataFrame after removing duplicates based on 'user_id' and 'product_id':")
print(df_unique)

Output

Original DataFrame:
   user_id product_id  transaction_id
0        1          A             101
1        2          B             102
2        1          A             103
3        3          C             104
4        2          B             105

DataFrame after removing duplicates based on 'user_id' and 'product_id':
   user_id product_id  transaction_id
0        1          A             101
1        2          B             102
3        3          C             104

✅ 36. Applying Functions

Simple Explanation

df.apply(func, axis=…): Applies a function to columns (axis=0) or rows (axis=1). Useful for operations involving multiple columns/rows.
series.map(func): Applies a function element-wise to a single column. Great for simple transformations or mapping values.
df.applymap(func): Applies a function element-wise to every element in the entire DataFrame. Useful for universal element-wise transformations.

Code

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print("Original DataFrame:")
print(df)

# --- df.apply() example: Sum of columns for each row ---
df['row_sum'] = df.apply(lambda row: row.A + row.B, axis=1)
print("\nAfter df.apply() to get row sum:")
print(df)

# --- series.map() example: Map numbers to words ---
number_map = {1: 'one', 2: 'two', 3: 'three'}
df['A_word'] = df['A'].map(number_map)
print("\nAfter series.map() to convert numbers to words:")
print(df)

# --- df.applymap() example: Convert all numbers to strings ---
df_str = df.applymap(str)
print("\nAfter df.applymap() to convert all elements to strings:")
print(df_str)

Output

Original DataFrame:
   A  B
0  1  4
1  2  5
2  3  6

After df.apply() to get row sum:
   A  B  row_sum
0  1  4        5
1  2  5        7
2  3  6        9

After series.map() to convert numbers to words:
   A  B  row_sum A_word
0  1  4        5     one
1  2  5        7     two
2  3  6        9   three

After df.applymap() to convert all elements to strings:
   A  B row_sum A_word
0  1  4      5    one
1  2  5      7    two
2  3  6      9  three

36. Applying Functions

Simple Explanation

df.apply(func, axis=…)
- Applies a function across rows (axis=1) or columns (axis=0).
- Useful when a calculation involves multiple columns or rows.
series.map(func)
- Applies a function element-wise to a single column (Pandas Series).
- Good for simple transformations, e.g., mapping values to words or categories.
df.applymap(func)
- Applies a function element-wise to every value in the DataFrame.
- Useful for universal transformations, like converting all numbers to strings.

Code Example

import pandas as pd

# Create a simple DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print("Original DataFrame:")
print(df)

# --- 1. df.apply(): Sum of columns for each row (axis=1) ---
df['row_sum'] = df.apply(lambda row: row.A + row.B, axis=1)
print("\nAfter df.apply() to get row sum:")
print(df)

# --- 2. series.map(): Map numbers to words ---
number_map = {1: 'one', 2: 'two', 3: 'three'}
df['A_word'] = df['A'].map(number_map)
print("\nAfter series.map() to convert numbers to words:")
print(df)

# --- 3. df.applymap(): Convert all elements to strings ---
df_str = df.applymap(str)
print("\nAfter df.applymap() to convert all elements to strings:")
print(df_str)

Output

Original DataFrame:
   A  B
0  1  4
1  2  5
2  3  6

After df.apply() to get row sum:
   A  B  row_sum
0  1  4        5
1  2  5        7
2  3  6        9

After series.map() to convert numbers to words:
   A  B  row_sum A_word
0  1  4        5     one
1  2  5        7     two
2  3  6        9   three

After df.applymap() to convert all elements to strings:
   A  B row_sum A_word
0  1  4      5    one
1  2  5      7    two
2  3  6      9  three

37. String Methods

Simple Explanation

Use .str accessor on a Pandas Series to apply string operations.
You can perform operations like:
- .split(), .replace(), .lower(), .upper(), .contains(), etc.
Example Use Case: Extracting the domain from an email by splitting the string at the ‘@’ symbol.

Code Example

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'email': ['alice@example.com', 'bob@work-mail.org']
})

print("Original DataFrame:")
print(df)

# --- Extract domain from email ---
# Split the email at '@' and take the second part (index 1)
df['domain'] = df['email'].str.split('@').str[1]

print("\nDataFrame after extracting domain:")
print(df)

Output

Original DataFrame:
    name              email
0  Alice    alice@example.com
1    Bob  bob@work-mail.org

DataFrame after extracting domain:
    name              email          domain
0  Alice    alice@example.com      example.com
1    Bob  bob@work-mail.org    work-mail.org

💡 Tip:
The .str accessor is powerful for all kinds of string manipulations in a DataFrame column. You can chain multiple string methods like:

df['domain'].str.upper().str.replace('-', '_')

38. GroupBy

Simple Explanation

GroupBy splits a DataFrame into groups based on some criteria.
You can apply a function (like sum, mean, count) to each group independently.
Finally, results are combined into a new data structure (Series or DataFrame).

Steps:

Split the data into groups (.groupby()).
Apply an aggregation or transformation (.sum(), .mean(), etc.).
Combine the results into a new object.

Code Example

import pandas as pd

# Create a sample sales DataFrame
data = {
    'region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'product': ['A', 'B', 'A', 'C', 'B', 'C'],
    'sales': [100, 150, 120, 200, 80, 250]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# --- Group by 'region' and calculate total sales ---
total_sales_by_region = df.groupby('region')['sales'].sum()

print("\nTotal sales for each region:")
print(total_sales_by_region)

Output

Original DataFrame:
  region product  sales
0   East       A    100
1   West       B    150
2   East       A    120
3   West       C    200
4   East       B     80
5   West       C    250

Total sales for each region:
region
East    300
West    600
Name: sales, dtype: int64

💡 Tip:
You can also group by multiple columns and use different aggregation functions:

df.groupby(['region', 'product'])['sales'].mean()

39. Aggregations (Most Common Product)

Simple Explanation

When working with grouped data, you often want to know which item appears most frequently in each group.
Steps:
1. Use .groupby() to group by a column (e.g., region).
2. Use .apply() with a custom function on the grouped column.
3. Inside the function:
  - .value_counts() counts occurrences of each unique value.
  - .idxmax() returns the value with the highest count.

Code Example

import pandas as pd

df = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West', 'East', 'West', 'East'],
    'product': ['A', 'B', 'A', 'C', 'B', 'C', 'A']  # Product 'A' is most common in East
})

print("Original DataFrame:")
print(df)

# Group by 'region' and find the most frequent product in each group
most_common_product = df.groupby('region')['product'].apply(lambda x: x.value_counts().idxmax())

print("\nMost common product in each region:")
print(most_common_product)

Output

Original DataFrame:
  region product
0   East       A
1   West       B
2   East       A
3   West       C
4   East       B
5   West       C
6   East       A

Most common product in each region:
region
East     A
West     C
Name: product, dtype: object

💡 Tip:

You can also use .agg() with a lambda for more complex summaries.
For example, to get both most common product and its count:

df.groupby('region')['product'].agg(lambda x: x.value_counts().idxmax())

40. Pivot Table

Simple Explanation

A pivot table reshapes data to summarize it.
You select:
1. Index (rows) → unique values of one column.
2. Columns → unique values of another column.
3. Values → data to fill in the table.
You can also specify an aggregation function (aggfunc) like mean, sum, count.

Code Example

import pandas as pd

data = {'region': ['East', 'West', 'East', 'West', 'East', 'West'],
        'product': ['A', 'B', 'A', 'C', 'B', 'C'],
        'sales': [100, 150, 120, 200, 80, 250]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Create a pivot table
pivot = pd.pivot_table(df,
                       index='region',      # Rows
                       columns='product',   # Columns
                       values='sales',      # Values to fill
                       aggfunc='mean')      # Aggregation function

print("\nPivot table of average sales:")
print(pivot)

Output

Original DataFrame:
  region product  sales
0   East       A    100
1   West       B    150
2   East       A    120
3   West       C    200
4   East       B     80
5   West       C    250

Pivot table of average sales:
product      A      B      C
region                        
East      110.0   80.0    NaN
West        NaN  150.0  225.0

💡 Tip:

NaN means no data exists for that combination (e.g., East has no C sales in this example).

41. Melting Data

Simple Explanation

Melting converts a DataFrame from wide to long format.
Multiple columns are combined into two columns:
1. variable → original column names.
2. value → values from those columns.
Useful for plotting or reshaping data for analysis.

Code Example

import pandas as pd

# Wide format DataFrame
df_wide = pd.DataFrame({
    'student': ['Alice', 'Bob'],
    'math_score': [90, 85],
    'english_score': [95, 80]
})

print("Original WIDE format DataFrame:")
print(df_wide)

# Melt to long format
df_long = pd.melt(df_wide,
                  id_vars=['student'],         # Column to keep
                  value_vars=['math_score', 'english_score'],  # Columns to unpivot
                  var_name='subject',          # Name of new 'variable' column
                  value_name='score')          # Name of new 'value' column

print("\nMelted to LONG format DataFrame:")
print(df_long)

Output

Original WIDE format DataFrame:
  student  math_score  english_score
0   Alice          90             95
1     Bob          85             80

Melted to LONG format DataFrame:
  student         subject  score
0   Alice     math_score     90
1     Bob     math_score     85
2   Alice  english_score     95
3     Bob  english_score     80

💡 Tip:

Melting is the opposite of a pivot table. After melting, you can easily group, aggregate, or plot the long-format data.

42. Merging DataFrames

Simple Explanation

Merging combines two DataFrames based on a common key column.
Common join types:

Join Type	Description
Inner	Keeps only rows where the key exists in both DataFrames.
Left	Keeps all rows from the left DataFrame and adds matching rows from the right. Non-matches become `NaN`.
Right	Keeps all rows from the right DataFrame and adds matching rows from the left. Non-matches become `NaN`.
Outer	Keeps all rows from both DataFrames. Non-matches on either side become `NaN`.

Code Example

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'city': ['New York', 'Chicago', 'Houston']})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Inner Join
inner_join = pd.merge(df1, df2, on='id', how='inner')
print("\n--- Inner Join ---")
print(inner_join)

# Left Join
left_join = pd.merge(df1, df2, on='id', how='left')
print("\n--- Left Join ---")
print(left_join)

# Outer Join
outer_join = pd.merge(df1, df2, on='id', how='outer')
print("\n--- Outer Join ---")
print(outer_join)

Output

DataFrame 1:
   id     name
0   1    Alice
1   2      Bob
2   3  Charlie

DataFrame 2:
   id      city
0   2  New York
1   3   Chicago
2   4   Houston

--- Inner Join ---
   id     name      city
0   2      Bob  New York
1   3  Charlie   Chicago

--- Left Join ---
   id     name      city
0   1    Alice       NaN
1   2      Bob  New York
2   3  Charlie   Chicago

--- Outer Join ---
   id     name      city
0   1    Alice       NaN
1   2      Bob  New York
2   3  Charlie   Chicago
3   4      NaN   Houston

💡 Tips:

Always specify the key column (on='id') for clarity.
Choose the join type depending on whether you want to keep unmatched rows.
Right Join works similarly to Left Join but keeps all rows from the right DataFrame.

43. Concatenation

Simple Explanation

Concatenation stacks DataFrames either vertically or horizontally:

Type	Description
Vertical (axis=0)	Stacks DataFrames on top of each other. Columns must match. Indexes may repeat.
Horizontal (axis=1)	Stacks DataFrames side by side. Indexes must match. Columns can be different.

Code Example

import pandas as pd

# DataFrames for vertical stacking
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})

# DataFrame for horizontal stacking
df3 = pd.DataFrame({'C': ['C0', 'C1'], 'D': ['D0', 'D1']}, index=[0, 1])

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# --- Vertical Stacking ---
vertical_stack = pd.concat([df1, df2])
print("\n--- Vertically Stacked ---")
print(vertical_stack)

# --- Horizontal Stacking ---
horizontal_stack = pd.concat([df1, df3], axis=1)
print("\n--- Horizontally Stacked ---")
print(horizontal_stack)

Output

DataFrame 1:
    A   B
0  A0  B0
1  A1  B1

DataFrame 2:
    A   B
0  A2  B2
1  A3  B3

--- Vertically Stacked ---
    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3

--- Horizontally Stacked ---
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1

💡 Tips:

For vertical stacking, mismatched columns will create NaN for missing columns.
For horizontal stacking, mismatched indexes will create NaN for missing rows.
pd.concat is very flexible; you can also use ignore_index=True to reindex vertically stacked DataFrames.

44. Cross Tabulation

Simple Explanation

pd.crosstab() creates a frequency table that counts how often combinations of two (or more) categorical variables occur. It’s very useful for understanding relationships between categories.

Code Example

import pandas as pd

df = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female'],
    'Preference': ['A', 'B', 'A', 'A', 'B', 'B']
})

print("Original DataFrame:")
print(df)

# Create a frequency table of Gender vs. Preference
cross_tab = pd.crosstab(df['Gender'], df['Preference'])

print("\nCross-tabulation (Frequency Table):")
print(cross_tab)

Output

Original DataFrame:
   Gender Preference
0    Male          A
1  Female          B
2  Female          A
3    Male          A
4    Male          B
5  Female          B

Cross-tabulation (Frequency Table):
Preference  A  B
Gender           
Female      1  2
Male        2  1

Tips

You can add margins=True to see row and column totals:

pd.crosstab(df['Gender'], df['Preference'], margins=True)

Works with more than 2 variables by passing multiple Series.

You can apply aggregation with the values and aggfunc parameters if counting numeric data instead of just frequency.

45. Datetime Conversion

Simple Explanation

To work effectively with dates in Pandas, you need to convert date strings to datetime objects. This allows for easy comparison, filtering, and time-based operations.

Code Example

import pandas as pd

# Create a DataFrame with a date column as strings
df = pd.DataFrame({
    'date': ['2023-10-27', '2023-10-28', '2023-10-29'],
    'sales': [200, 250, 180]
})

print("Original DataFrame:")
print(df)
print("\nData type of 'date' column:", df['date'].dtype)

# Convert the 'date' column to datetime objects
df['date'] = pd.to_datetime(df['date'])

print("\nDataFrame after conversion:")
print(df)
print("\nNew data type of 'date' column:", df['date'].dtype)

Output

Original DataFrame:
         date  sales
0  2023-10-27    200
1  2023-10-28    250
2  2023-10-29    180

Data type of 'date' column: object

DataFrame after conversion:
        date  sales
0 2023-10-27    200
1 2023-10-28    250
2 2023-10-29    180

New data type of 'date' column: datetime64[ns]

46. Time-based Filtering

Simple Explanation

Once the date column is converted and set as the index, you can filter by year, month, or date range using string-based indexing. This is extremely useful for time series data analysis.

Code Example

import pandas as pd

# Create a DataFrame with a DatetimeIndex
date_rng = pd.date_range(start='2022-01-01', end='2023-10-30', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = range(len(df))
df.set_index('date', inplace=True)

print("Original DataFrame (head):")
print(df.head())

# Select all data from the year 2022
data_2022 = df['2022']
print("\nData from the year 2022 (head):")
print(data_2022.head())

# Select data from a specific month
data_jan_2022 = df['2022-01']
print("\nData from January 2022 (head):")
print(data_jan_2022.head())

Output

Original DataFrame (head):
            data
date
2022-01-01     0
2022-01-02     1
2022-01-03     2
2022-01-04     3
2022-01-05     4

Data from the year 2022 (head):
            data
date
2022-01-01     0
2022-01-02     1
2022-01-03     2
2022-01-04     3
2022-01-05     4

Data from January 2022 (head):
            data
date
2022-01-01     0
2022-01-02     1
2022-01-03     2
2022-01-04     3
2022-01-05     4

Tips

After conversion, you can easily extract parts of the date:

df['year'] = df.index.year
df['month'] = df.index.month
df['day'] = df.index.day

Combine filtering with conditions:

df['2022-03':'2022-06']  # Data from March to June 2022

47. Resampling

Simple Explanation

Resampling is used to change the frequency of time series data:

Downsampling: Reduce the frequency (e.g., daily → monthly) by aggregating values (mean, sum, max, etc.).
Upsampling: Increase the frequency (e.g., monthly → daily) by filling or interpolating missing values.

The .resample() method is used on a DatetimeIndex and requires an aggregation function for downsampling.

Code Example

import pandas as pd
import numpy as np

# Create a DataFrame with daily stock prices
date_rng = pd.date_range(start='2023-01-01', periods=90, freq='D')
df_daily = pd.DataFrame(date_rng, columns=['date'])
df_daily['price'] = np.random.randint(100, 150, size=len(date_rng))
df_daily.set_index('date', inplace=True)

print("Original daily data (head):")
print(df_daily.head())

# Downsample daily data to monthly average price
df_monthly = df_daily['price'].resample('M').mean()

print("\nResampled monthly average price:")
print(df_monthly)

Output Example

Original daily data (head):
            price
date
2023-01-01    106
2023-01-02    129
2023-01-03    108
2023-01-04    112
2023-01-05    119

Resampled monthly average price:
date
2023-01-31    124.03
2023-02-28    126.61
2023-03-31    124.64
Name: price, dtype: float64

Tips

Common frequency codes for .resample():
- 'D' → Daily
- 'W' → Weekly
- 'M' → Month-end
- 'Q' → Quarter-end
- 'Y' → Year-end
Example of upsampling with forward fill:

df_monthly_upsampled = df_monthly.resample('D').ffill()

You can combine .resample() with any aggregation function:

df_daily['price'].resample('W').max()  # Weekly maximum

48. Rolling Windows

Simple Explanation

A rolling window performs calculations over a fixed-size “window” of consecutive data points that moves across the time series:

Each window contains a subset of consecutive rows.
Common uses: moving averages, rolling sums, min/max, standard deviation, etc.
Example: a 7-day rolling average calculates the average of the current day plus the previous 6 days, then slides forward one day and repeats.

Code Example

import pandas as pd

# Daily sales data
df = pd.DataFrame({
    'date': pd.date_range(start='2023-01-01', periods=15, freq='D'),
    'sales': [10, 20, 15, 30, 25, 40, 35, 50, 45, 60, 55, 70, 65, 80, 75]
})
df.set_index('date', inplace=True)

print("Original daily sales:")
print(df)

# Calculate a 7-day rolling average
df['7_day_rolling_avg'] = df['sales'].rolling(window=7).mean()

print("\nDataFrame with 7-day rolling average:")
print(df)

Output Example

Original daily sales:
            sales
date
2023-01-01     10
2023-01-02     20
2023-01-03     15
2023-01-04     30
2023-01-05     25
2023-01-06     40
2023-01-07     35
2023-01-08     50
2023-01-09     45
2023-01-10     60
2023-01-11     55
2023-01-12     70
2023-01-13     65
2023-01-14     80
2023-01-15     75

DataFrame with 7-day rolling average:
            sales  7_day_rolling_avg
date
2023-01-01     10                 NaN
2023-01-02     20                 NaN
2023-01-03     15                 NaN
2023-01-04     30                 NaN
2023-01-05     25                 NaN
2023-01-06     40                 NaN
2023-01-07     35               25.0
2023-01-08     50               30.0
2023-01-09     45               35.0
2023-01-10     60               40.0
2023-01-11     55               45.0
2023-01-12     70               50.0
2023-01-13     65               55.0
2023-01-14     80               60.0
2023-01-15     75               65.0

Tips

Window size determines how many rows are included in the calculation.
The first few rows will often be NaN because the window isn’t full yet.
You can compute other functions like:

df['7_day_rolling_sum'] = df['sales'].rolling(window=7).sum()
df['7_day_rolling_std'] = df['sales'].rolling(window=7).std()

Works well for smoothing noisy time series data.

49. Time Deltas

Simple Explanation

A Time Delta represents the difference between two dates or times. In Pandas:

Subtracting two datetime columns gives a Timedelta Series.
You can extract useful information such as:
- .dt.days → total days
- .dt.seconds → total seconds
- .dt.total_seconds() → total duration in seconds

This is very useful for calculating durations, age, or elapsed time between events.

Code Example

import pandas as pd

# Create a DataFrame with two datetime columns
df = pd.DataFrame({
    'start_date': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-20']),
    'end_date': pd.to_datetime(['2023-01-10', '2023-03-01', '2023-04-01'])
})

print("Original DataFrame:")
print(df)

# Calculate time difference
df['time_delta'] = df['end_date'] - df['start_date']

# Extract difference in days
df['days_difference'] = df['time_delta'].dt.days

print("\nDataFrame with time difference in days:")
print(df)

Output Example

Original DataFrame:
  start_date   end_date
0 2023-01-01 2023-01-10
1 2023-02-15 2023-03-01
2 2023-03-20 2023-04-01

DataFrame with time difference in days:
  start_date   end_date time_delta  days_difference
0 2023-01-01 2023-01-10    9 days                9
1 2023-02-15 2023-03-01   14 days               14
2 2023-03-20 2023-04-01   12 days               12

Tips

Timedelta can also be used for arithmetic with dates, e.g., adding/subtracting days:

df['new_date'] = df['start_date'] + pd.Timedelta(days=7)

Works with hours, minutes, seconds, e.g., pd.Timedelta(hours=5).
Ideal for time-based calculations like calculating SLA, age, or subscription durations.

50. Matplotlib Basics (Line Plot)

Simple Explanation

Line plots visualize trends over a continuous variable (like time).
Steps:
1. Import matplotlib.pyplot.
2. Prepare x and y data.
3. Use plt.plot(x, y).
4. Add labels, title, and show the plot.

Code Example

import matplotlib.pyplot as plt

# Sample data
years = [2018, 2019, 2020, 2021, 2022]
sales = [15000, 18000, 16000, 22000, 27000]

# Line plot
plt.plot(years, sales)

# Add title and labels
plt.title("Yearly Sales")
plt.xlabel("Year")
plt.ylabel("Sales ($)")

# Display
plt.show()

Output:
A line chart showing sales trends over years with proper labels.

51. Scatter Plot

Simple Explanation

Scatter plots visualize the relationship or correlation between two numerical variables.
Use plt.scatter(x, y).

Code Example

import matplotlib.pyplot as plt

# Sample data
age = [25, 30, 35, 40, 45, 50, 55, 60]
income = [40000, 55000, 60000, 75000, 90000, 110000, 95000, 120000]

# Scatter plot
plt.scatter(age, income)

# Add title and labels
plt.title("Age vs. Income")
plt.xlabel("Age")
plt.ylabel("Annual Income ($)")

# Display
plt.show()

Output:
A scatter plot showing the relationship between age and income (generally positive correlation).

52. Histogram

Simple Explanation

Histograms show the distribution of a single numerical variable.
Use plt.hist(data, bins=number_of_bins) to control granularity.

Code Example

import matplotlib.pyplot as plt
import numpy as np

# Sample data: ages of 100 customers
customer_age = np.random.randint(18, 70, size=100)

# Histogram
plt.hist(customer_age, bins=10, edgecolor='black')

# Add title and labels
plt.title("Distribution of Customer Age")
plt.xlabel("Age")
plt.ylabel("Number of Customers")

# Display
plt.show()

Output:
Histogram showing how customer ages are distributed across 10 bins.

53. Seaborn vs. Matplotlib

Simple Explanation

Matplotlib: Foundational plotting library. Very flexible, but can require a lot of code for polished plots.
Seaborn: Built on top of Matplotlib. Simplifies statistical visualization and improves aesthetics.

Key Advantages of Seaborn

Better Aesthetics: Beautiful default styles.
Works with DataFrames: Directly use column names from a DataFrame. sns.scatterplot(x='col1', y='col2', data=df)
Complex Plots Made Easy: Box plots, violin plots, heatmaps, pair plots, etc., are simpler to create than with Matplotlib alone.

54. Box Plot

Simple Explanation

Box plots show the distribution of a numerical variable across categories.
Displays median, quartiles, and outliers.
Seaborn makes it simple with sns.boxplot().

Code Example

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
data = {'department': ['HR', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales'],
        'salary': [60000, 90000, 75000, 65000, 110000, 120000, 62000, 95000, 80000]}
df = pd.DataFrame(data)

# Create the box plot
sns.boxplot(x='department', y='salary', data=df)

# Add title
plt.title("Salary Distribution by Department")

# Display
plt.show()

Output:
Three boxes (one per department) showing median, quartiles, and outliers.

55. Heatmap

Simple Explanation

A heatmap visualizes values in a matrix using colors.
Often used for correlation matrices.
Use df.corr() to compute correlations, then sns.heatmap() to visualize.

Code Example

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
df = pd.DataFrame({'age': [25, 30, 35, 40],
                   'income': [50000, 60000, 75000, 90000],
                   'score': [85, 88, 92, 95]})

# Compute correlation matrix
correlation_matrix = df.corr()

# Create heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

# Add title
plt.title("Correlation Matrix Heatmap")

# Display
plt.show()

Output:
A 3×3 heatmap showing correlations, with values annotated in each cell and color-coded.

56. Subplots

Simple Explanation

plt.subplots() lets you create multiple plots in a single figure.
Returns a figure object (fig) and axes object(s) (ax) for plotting individually.
Use ax[i] to access a specific subplot.
figsize controls the overall figure size.
plt.tight_layout() prevents overlapping elements.

Code Example

import matplotlib.pyplot as plt

# Sample data
years = [2018, 2019, 2020, 2021, 2022]
sales = [150, 180, 160, 220, 270]
profit = [20, 35, 15, 50, 65]

# Create a figure with 1 row and 2 columns of subplots
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Plot on the first subplot
ax[0].plot(years, sales)
ax[0].set_title('Yearly Sales')
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Sales ($)')

# Plot on the second subplot
ax[1].bar(years, profit, color='green')
ax[1].set_title('Yearly Profit')
ax[1].set_xlabel('Year')
ax[1].set_ylabel('Profit ($)')

# Adjust layout and display
plt.tight_layout()
plt.show()

Output:
A single figure with two plots side-by-side:

Left: line plot for sales
Right: bar chart for profit

57. Customization

Simple Explanation

Add titles and axis labels using:
- plt.title() → plot title
- plt.xlabel() → x-axis label
- plt.ylabel() → y-axis label
You can also change colors, markers, line styles, and fonts for further customization.

Code Example

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.plot(x, y, marker='o', linestyle='--', color='orange')

# Add title and labels
plt.title("My Simple Plot")
plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")

# Display the plot
plt.show()

Output:
A line plot with a title, labeled axes, and custom markers and line style.

58. Saving Plots

Simple Explanation

After creating a Matplotlib plot, save it to a file using plt.savefig().
Call plt.savefig() before plt.show().
File format is determined by the extension: .png, .pdf, .svg, etc.

Code Example

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.plot(x, y)
plt.title("Plot to be Saved")

# Save the figure
plt.savefig("my_plot.png")  # PNG file
# plt.savefig("my_plot.pdf")  # PDF file

plt.show()

Output:

The plot is displayed.
A file my_plot.png is created in the current directory.

59. Interactive Plots

Simple Explanation

Interactive plots allow hovering, zooming, panning, and selection, which is great for dashboards or web apps.
Use libraries like Plotly for interactive charts.
Static plots (Matplotlib/Seaborn) are better for reports, while interactive ones are for user exploration.

Example Scenarios:

Hover over a stock price line to see exact values.
Zoom into a one-month period in a time series.
Compare multiple categories dynamically.

60. Categorical Plot (Count Plot)

Simple Explanation

Count plots are like histograms for categorical variables.
Shows how many times each category appears in your data.
Use Seaborn’s sns.countplot().

Code Example

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
data = {'product_type': ['Electronics', 'Clothing', 'Books', 'Electronics', 'Clothing', 'Electronics']}
df = pd.DataFrame(data)

# Create count plot
sns.countplot(x='product_type', data=df)

plt.title("Frequency of Product Types")
plt.show()

Output:

Bar chart showing:
- Electronics → 3
- Clothing → 2
- Books → 1

61. Train-Test Split

Simple Explanation

The goal is to split your dataset into two parts:
1. Training set: Used to train your model.
2. Testing set: Used to evaluate the model on unseen data.
This prevents overfitting and checks if your model generalizes well.
test_size controls the fraction of data for testing.
random_state ensures reproducibility.

Code Example

import pandas as pd
from sklearn.model_selection import train_test_split

# Sample features (X) and target (y)
X = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 75000, 90000, 110000]
})
y = pd.Series([0, 0, 1, 1, 1])  # 0: No Purchase, 1: Purchase

# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Shapes:")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}, y_test: {y_test.shape}")

print("\nX_train:")
print(X_train)

Output Example

Original Features (X):
   age  salary
0   25   50000
1   30   60000
2   35   75000
3   40   90000
4   45  110000

Original Target (y):
0    0
1    0
2    1
3    1
4    1
dtype: int64

Shapes:
X_train: (4, 2), X_test: (1, 2)
y_train: (4,), y_test: (1,)

X_train:
   age  salary
4   45  110000
2   35   75000
0   25   50000
3   40   90000

✅ Key Points

train_test_split() shuffles the data by default.
test_size=0.2 → 20% of data is used for testing.
random_state=42 ensures the split is the same every time.
Training set is used to fit the model, testing set to evaluate performance.

62. Feature Scaling

Simple Explanation

Feature scaling ensures that numerical variables are on the same scale, which improves performance for many machine learning algorithms.
Two common scalers:

StandardScaler
- Rescales data to have mean = 0 and standard deviation = 1.
- Useful for algorithms that assume normal distribution (e.g., Linear Regression, SVM, Logistic Regression).
MinMaxScaler
- Rescales data to a fixed range, usually [0, 1].
- Useful for algorithms sensitive to magnitude (e.g., Neural Networks) or when preserving sparsity is important.

Code Example

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data with features on different scales
data = {'age': [25, 30, 35, 40], 'income': [50000, 60000, 75000, 90000]}
df = pd.DataFrame(data)

print("Original Data:")
print(df)

# --- StandardScaler ---
scaler_standard = StandardScaler()
df_standardized = pd.DataFrame(scaler_standard.fit_transform(df), columns=df.columns)
print("\nAfter StandardScaler (mean=0, std=1):")
print(df_standardized)

# --- MinMaxScaler ---
scaler_minmax = MinMaxScaler()
df_normalized = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)
print("\nAfter MinMaxScaler (range=[0, 1]):")
print(df_normalized)

Output

Original Data:
   age  income
0   25   50000
1   30   60000
2   35   75000
3   40   90000

After StandardScaler (mean=0, std=1):
        age    income
0 -1.341641 -1.269269
1 -0.447214 -0.423090
2  0.447214  0.564120
3  1.341641  1.128239

After MinMaxScaler (range=[0, 1]):
   age    income
0  0.0  0.000000
1  0.3  0.285714
2  0.6  0.714286
3  1.0  1.000000

✅ Key Points

Scaling helps models converge faster and improves accuracy.
Always fit the scaler on the training set and then transform both training and testing sets to avoid data leakage.
StandardScaler is good when data has outliers; MinMaxScaler keeps data in a specific range.

63. Encoding Categorical Variables

Simple Explanation

Many machine learning algorithms can only handle numerical input.
One-Hot Encoding converts a categorical column into multiple binary columns, one for each category.
Example: For a city column with values 'New York', 'London', 'Paris':
- New columns created: city_New York, city_London, city_Paris
- Each row gets a 1 in the column corresponding to its category and 0 elsewhere.

Code Example

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample DataFrame with a categorical feature
df = pd.DataFrame({'city': ['New York', 'London', 'New York', 'Paris', 'London']})

print("Original DataFrame:")
print(df)

# Initialize OneHotEncoder
# sparse_output=False makes the output a dense NumPy array
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the data
onehot_encoded = encoder.fit_transform(df[['city']])

# Create a new DataFrame with the encoded columns
encoded_df = pd.DataFrame(onehot_encoded, columns=encoder.get_feature_names_out(['city']))

print("\nOne-Hot Encoded DataFrame:")
print(encoded_df)

Output

Original DataFrame:
       city
0  New York
1    London
2  New York
3     Paris
4    London

One-Hot Encoded DataFrame:
   city_London  city_New York  city_Paris
0          0.0            1.0         0.0
1          1.0            0.0         0.0
2          0.0            1.0         0.0
3          0.0            0.0         1.0
4          1.0            0.0         0.0

✅ Key Points

Each categorical value is now represented numerically without imposing any order.
One-Hot Encoding is ideal for nominal variables (no natural order).
For ordinal variables (like 'Low' < 'Medium' < 'High'), use Label Encoding instead.

64. Label Encoding

Simple Explanation

Label Encoding converts each category in a column into a unique integer.
Example: size column with values 'S', 'M', 'L' might be encoded as:
- 'L' → 0
- 'M' → 1
- 'S' → 2
Best for ordinal features where there is a natural order.
Not recommended for nominal features (like city names), because numbers imply a ranking that does not exist.

Code Example

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample DataFrame with an ordinal feature
df = pd.DataFrame({'size': ['S', 'M', 'L', 'S', 'M']})

print("Original DataFrame:")
print(df)

# Initialize LabelEncoder
encoder = LabelEncoder()

# Fit and transform the 'size' column
df['size_encoded'] = encoder.fit_transform(df['size'])

print("\nDataFrame after Label Encoding:")
print(df)

# Show mapping of categories to numbers
print("\nMapping of categories to numbers:")
print(dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))

Output

Original DataFrame:
  size
0    S
1    M
2    L
3    S
4    M

DataFrame after Label Encoding:
  size  size_encoded
0    S             2
1    M             1
2    L             0
3    S             2
4    M             1

Mapping of categories to numbers:
{'L': 0, 'M': 1, 'S': 2}

✅ Key Points

Preserves order in ordinal features.
Reduces dimensionality compared to one-hot encoding.
Be careful: using label encoding on nominal features can mislead models that assume numerical order.

65. Pipeline

Simple Explanation

A Scikit-learn Pipeline chains multiple steps together, such as preprocessing (scaling, encoding) and model training.
Benefits:
1. Prevents data leakage: ensures transformations are learned only from the training set.
2. Simplifies workflow: treat preprocessing + model as a single object.
3. Reduces errors: avoids manually applying transformations to test data.

Code Example

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Sample data with numerical and categorical features
X = pd.DataFrame({
    'age': [25, 30, 35, 40],
    'city': ['New York', 'London', 'New York', 'Paris']
})
y = pd.Series([0, 1, 0, 1])

# Define column types
numerical_features = ['age']
categorical_features = ['city']

# Preprocessor: scale numeric, one-hot encode categorical
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Pipeline: preprocessing + logistic regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

print("Pipeline steps:")
print(pipeline)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Fit pipeline
pipeline.fit(X_train, y_train)

# Make predictions
predictions = pipeline.predict(X_test)
print(f"\nPredictions on test set: {predictions}")

Output

Pipeline steps:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['age']),
                                                 ('cat', OneHotEncoder(),
                                                  ['city'])])),
                ('classifier', LogisticRegression())])

Predictions on test set: [1]

✅ Key Points

The pipeline automatically applies preprocessing to any new data, including the test set.
Useful for cross-validation, hyperparameter tuning, or deploying models.
You can easily replace steps, e.g., change LogisticRegression() to RandomForestClassifier() without rewriting preprocessing.

66. Handling Imbalanced Data

Simple Explanation

Imbalanced data occurs when one class dominates the target variable, e.g., fraud detection with 99% non-fraud vs 1% fraud.
Problems:
- The model may always predict the majority class, achieving high accuracy but poor real performance.
Common Solutions:
1. SMOTE (Synthetic Minority Over-sampling Technique):
  - Generates synthetic samples of the minority class.
  - Helps the model learn a better decision boundary instead of just duplicating existing points.
2. Class Weights:
  - Assigns higher penalty to misclassifying minority class.
  - Some models (e.g., Logistic Regression, Random Forest) support class_weight='balanced'.

Code Example

import pandas as pd
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression

# Create an imbalanced dataset
X_imb = pd.DataFrame({'feature': range(100)})
y_imb = pd.Series([0]*95 + [1]*5)  # 95 zeros, 5 ones

print(f"Original class distribution: {Counter(y_imb)}")

# --- Technique 1: SMOTE ---
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_imb, y_imb)
print(f"Class distribution after SMOTE: {Counter(y_resampled)}")

# --- Technique 2: Class Weights ---
# Logistic Regression with class weights
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_imb, y_imb)

print("\nModel trained with class_weight='balanced'.")
print("This technique adjusts the model internally without changing the data.")

Output

Original class distribution: Counter({0: 95, 1: 5})
Class distribution after SMOTE: Counter({0: 95, 1: 95})

Model trained with class_weight='balanced'.
This technique adjusts the model internally without changing the data.

✅ Key Points

SMOTE changes the dataset by adding synthetic minority samples.
Class weights keep the original dataset but modify the learning algorithm.
Both methods are widely used in fraud detection, disease prediction, and anomaly detection tasks.

67. Model Instantiation

Simple Explanation

Training a machine learning model usually involves two main steps:

Instantiate the model
- You create an object of the model class, e.g., LogisticRegression(), RandomForestClassifier(), or LinearRegression().
- You can specify hyperparameters here (like max_depth, C, n_estimators).
Fit the model to your training data
- Use the .fit(X_train, y_train) method.
- The model learns patterns from the input features (X_train) and the target labels (y_train).

Code Example

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Sample dataset
X = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 75000, 90000, 110000]
})
y = pd.Series([0, 0, 1, 1, 1])

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# --- 1. Instantiate the model ---
model = LogisticRegression(random_state=42)

# --- 2. Train the model ---
model.fit(X_train, y_train)

print("LogisticRegression model has been trained successfully!")

Output

LogisticRegression model has been trained successfully!

✅ Key Points

Instantiation sets up the model structure and hyperparameters.
Fitting trains the model on training data.
After training, the model can be used for predictions on new or test data using .predict().

68. Making Predictions

Simple Explanation

After a model is trained:

Use .predict() on unseen data (e.g., X_test) to get the predicted labels.
Compare these predictions with the actual labels (y_test) to check performance.

Code Example

# Assuming 'model' is already trained from previous steps

# Make predictions on the test set
predictions = model.predict(X_test)

print("Predicted values on the test set:")
print(predictions)

print("\nActual values in the test set:")
print(y_test.values)

Output Example:

Predicted values on the test set:
[1]

Actual values in the test set:
[1]

✅ Key Point:
Even if the dataset is small, the predicted values show how the model performs on unseen data.

69. Classification Metrics

Simple Explanation

To evaluate a classification model, common metrics include:

Metric	Description
Accuracy	Fraction of correctly predicted labels overall
Precision	Out of all predicted positives, how many are actually positive (low false positives)
Recall	Out of all actual positives, how many were correctly predicted (low false negatives)
F1-Score	Harmonic mean of precision and recall; balances both metrics

Scikit-learn’s classification_report summarizes all these metrics at once.

Code Example

from sklearn.metrics import classification_report

# Example: test labels and predictions
y_test_example = [1, 0, 1, 1, 0]
predictions_example = [1, 0, 0, 1, 1]

# Generate the classification report
report = classification_report(y_test_example, predictions_example)

print("Classification Report:")
print(report)

Output Example:

Classification Report:
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         2
           1       1.00      0.67      0.80         3

    accuracy                           0.80         5
   macro avg       0.75      0.83      0.73         5
weighted avg       0.80      0.80      0.75         5

✅ Key Points:

support shows the number of true instances for each class.
accuracy is overall correctness.
macro avg averages metrics equally across classes.
weighted avg averages metrics based on class support, useful for imbalanced datasets.

70. Confusion Matrix

Confusion Matrix Explained

A confusion matrix is a table that shows how well your classification model performs by comparing the actual labels with the predicted labels. It breaks down predictions into four categories:

Predicted \ Actual	Actual 0	Actual 1
Predicted 0	TN	FN
Predicted 1	FP	TP

True Positive (TP): Correctly predicted positive (model said 1, actual is 1).
True Negative (TN): Correctly predicted negative (model said 0, actual is 0).
False Positive (FP): Incorrectly predicted positive (model said 1, actual is 0) – Type I Error.
False Negative (FN): Incorrectly predicted negative (model said 0, actual is 1) – Type II Error.

A heatmap is a convenient way to visualize the confusion matrix.

Python Code

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Example data
y_test_example = [1, 0, 1, 1, 0]
predictions_example = [1, 0, 0, 1, 1]

# Generate the confusion matrix
cm = confusion_matrix(y_test_example, predictions_example)

# Visualize using a heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Output Interpretation

Heatmap cells:

	Predicted 0	Predicted 1
Actual 0	1 (TN)	1 (FP)
Actual 1	1 (FN)	2 (TP)

True Negative (TN): 1
False Positive (FP): 1
False Negative (FN): 1
True Positive (TP): 2

✅ Key Takeaways:

Shows where the model makes mistakes.
Useful for imbalanced datasets where accuracy alone is misleading.
Works well with metrics like Precision, Recall, and F1-score for deeper evaluation.

71. Regression Metrics:

Regression Metrics Explained

When predicting continuous values, we evaluate model performance differently than in classification.

Python Code

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Sample true and predicted values
y_true = np.array([10, 20, 30, 40, 50])
y_pred = np.array([12, 18, 33, 38, 51])

# Calculate metrics
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

Output

Mean Absolute Error (MAE): 2.00
Mean Squared Error (MSE): 5.00
R-squared (R²): 0.98

✅ Key Takeaways:

MAE → “Average error magnitude”
MSE → Penalizes large mistakes more
R² → How well the model explains the variation in data

72. Cross-Validation(K-Fold Cross-Validation)

Concept Recap

Goal: Get a more reliable estimate of a model’s performance on unseen data.

How it works:

Split the dataset into K folds (e.g., 5 folds).
Train the model K times:
- Each time, use K-1 folds for training.
- Use the remaining fold for testing.
Collect the scores from each fold.
Average the scores → gives a more robust performance metric.

Why it’s better than a single train-test split:
Reduces bias from relying on a single split, especially important for small datasets.

Example Code

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Sample data
X = pd.DataFrame({'age': [25, 30, 35, 40, 45, 50, 55, 60]})
y = pd.Series([0, 0, 1, 1, 0, 1, 0, 1])

model = LogisticRegression(random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)

print("Cross-validation scores for each fold:", cv_scores)
print(f"Average CV score: {cv_scores.mean():.2f}")

Step-by-step Explanation

Data: 8 samples, age as feature, binary y labels.
Model: Logistic Regression.
CV Process:
- Split 8 samples into 5 folds → each fold has 1 or 2 samples.
- Train 5 times, each time testing on a different fold.
Output scores:[0.5, 0.5, 0.5, 1.0, 0.0]
- Shows accuracy on each fold.
- Some folds perform perfectly (1.0), some poorly (0.0), others in between (0.5).
Average CV score: 0.50 → overall estimated model performance.

Key Points

Individual fold scores may vary a lot on small datasets.
Cross-validation gives a more realistic performance estimate than a single train/test split.
Can use different scoring metrics: accuracy, F1-score, ROC-AUC, etc.

Output

Cross-validation scores for each fold: [0.5 0.5 0.5 1.  0. ]
Average CV score: 0.50

73. Hyperparameter Tuning

Simple Explanation:
Hyperparameters are settings you configure before training a model (e.g., C in LogisticRegression). Tuning means finding the best combination to improve model performance.

Two popular methods:

Method	Description
GridSearchCV	Exhaustively tests all possible combinations of hyperparameters you provide. Very thorough but can be slow.
RandomizedSearchCV	Tests a fixed number of random combinations from the hyperparameter space. Faster, often finds a near-optimal solution.

Code Example (Conceptual Setup):

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression

# Hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

model = LogisticRegression(solver='liblinear', random_state=42)

# --- GridSearchCV ---
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, verbose=1)
# grid_search.fit(X_train, y_train)
# print(f"Best parameters (GridSearch): {grid_search.best_params_}")

# --- RandomizedSearchCV ---
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
                                   n_iter=5, cv=3, verbose=1, random_state=42)
# random_search.fit(X_train, y_train)
# print(f"Best parameters (RandomizedSearch): {random_search.best_params_}")

Conceptual Output:

Best parameters (GridSearch): {'C': 10, 'penalty': 'l2'}
Best parameters (RandomizedSearch): {'C': 1, 'penalty': 'l2'}

Key Points:

GridSearchCV: thorough, slower.
RandomizedSearchCV: faster, good for large search spaces.
Always use cross-validation (cv) to avoid overfitting during tuning.

74. Feature Importance

Simple Explanation:
Tree-based models (like RandomForestClassifier) can calculate a feature importance score for each feature. This tells you how much each feature contributes to the model’s decisions. Higher → more important.

Code Example:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

# Sample data
X = pd.DataFrame({
    'age': [25, 30, 35, 40],
    'salary': [50000, 60000, 75000, 90000],
    'gender': [0, 1, 0, 1]
})
y = pd.Series([0, 1, 0, 1])

# Train RandomForest
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Feature importance
importances = model.feature_importances_
feature_names = X.columns

# Create a DataFrame
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)

# Plot
plt.figure(figsize=(8, 6))
plt.barh(feature_importance_df['feature'], feature_importance_df['importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.gca().invert_yaxis()  # Most important on top
plt.show()

Output:

A horizontal bar chart showing the relative importance of features.
In this example, 'salary' is likely the most important feature, followed by 'age' and 'gender'.

Key Points:

Useful for feature selection and model interpretation.
Only works natively with tree-based models (Random Forest, XGBoost, Decision Tree, etc.).
Can guide removing unimportant features to simplify the model.

75. Saving/Loading Models

Simple Explanation:
After training a model, you can save it to a file so you don’t need to retrain it every time.

joblib is recommended for Scikit-learn models because it handles large arrays efficiently.
Later, you can load the saved model and make predictions immediately.

Code Example

import joblib
from sklearn.linear_model import LogisticRegression
import pandas as pd

# --- 1. Train and Save the Model ---
X_train = pd.DataFrame({'age': [25, 30, 35], 'salary': [50000, 60000, 75000]})
y_train = pd.Series([0, 1, 0])

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Save the model
joblib.dump(model, 'my_trained_model.pkl')
print("Model saved to my_trained_model.pkl")

# --- 2. Load the Model and Make a Prediction ---
loaded_model = joblib.load('my_trained_model.pkl')
print("\nModel loaded successfully.")

# New data for prediction
X_new = pd.DataFrame({'age': [40], 'salary': [90000]})
prediction = loaded_model.predict(X_new)

print(f"\nPrediction for new data {X_new.values[0]}: {prediction[0]}")

Step-by-step Explanation

Train the model:
- Using LogisticRegression with sample training data (age and salary).
Save the model:
- joblib.dump(model, 'my_trained_model.pkl') creates a file my_trained_model.pkl on disk.
Load the model:
- joblib.load('my_trained_model.pkl') restores the trained model.
Make predictions:
- You can immediately predict on new data without retraining.

Output

Model saved to my_trained_model.pkl

Model loaded successfully.

Prediction for new data [40 90000]: 1

Key Points:

Use joblib for Scikit-learn models; pickle also works but is slower for large arrays.
Saving/loading models is essential for deploying models in production or sharing them.
The loaded model behaves exactly the same as the original trained model.

76. SQL Integration

Simple Explanation:
You can use pandas.read_sql() to execute a SQL query and load the results directly into a Pandas DataFrame. You need:

A SQL query string.
A connection object to the database.

Code Example

import pandas as pd
import sqlite3

# --- 1. Setup a dummy in-memory SQL database ---
conn = sqlite3.connect(':memory:')  # Temporary database in RAM
cursor = conn.cursor()

# Create table and insert data
cursor.execute("CREATE TABLE users (id INTEGER, name TEXT, age INTEGER);")
cursor.execute("INSERT INTO users VALUES (1, 'Alice', 25);")
cursor.execute("INSERT INTO users VALUES (2, 'Bob', 30);")
cursor.execute("INSERT INTO users VALUES (3, 'Charlie', 35);")
conn.commit()

# --- 2. Load data using pandas.read_sql ---
sql_query = "SELECT * FROM users WHERE age > 25;"
df = pd.read_sql(sql_query, conn)

print("DataFrame loaded from SQL database:")
print(df)

# Close the connection
conn.close()

Output

DataFrame loaded from SQL database:
   id     name  age
0   2      Bob   30
1   3  Charlie   35

Key Points:

pandas.read_sql() works with SQL databases supported by Python (SQLite, MySQL, PostgreSQL, etc.).
Great for directly importing query results into Pandas for analysis.
Always close the connection after use.

77. Web Scraping

Simple Explanation:
You can use BeautifulSoup to parse HTML and extract information, such as headlines or links.

requests fetches HTML content from a URL.
BeautifulSoup parses the HTML so you can find specific tags or elements.

Code Example

from bs4 import BeautifulSoup

# Sample HTML content (normally fetched using requests.get(url).text)
html_doc = """
<html><head><title>The Daily Scrape</title></head>
<body>
<p class="story">Once upon a time there were three little sisters.</p>
<h2 class="headline">First Headline</h2>
<h2 class="headline">Second Headline</h2>
<p>And they lived at the bottom of a well.</p>
</body></html>
"""

# Parse the HTML
soup = BeautifulSoup(html_doc, 'html.parser')

# Find all h2 tags with class 'headline'
headlines = soup.find_all('h2', class_='headline')

print("Found headlines:")
for headline in headlines:
    print(headline.text)

Output

Found headlines:
First Headline
Second Headline

Key Points:

BeautifulSoup is ideal for parsing HTML/XML.
Use .find_all() or .find() to extract specific elements.
Combine with requests to scrape real web pages.
Be mindful of website scraping policies (robots.txt, Terms of Service).

78. API Requests

Simple Explanation:
You can use the requests library to send HTTP requests to an API endpoint. Most APIs return data in JSON format, which can be easily converted into a Python dictionary.

Code Example

import requests
import json

# A public API endpoint (no authentication required)
api_url = "https://jsonplaceholder.typicode.com/posts/1"

try:
    # Send a GET request to the API
    response = requests.get(api_url)

    # Raise an exception if the request failed
    response.raise_for_status()

    # Parse JSON response into a Python dictionary
    data = response.json()

    # Pretty print the response
    print("Successfully fetched data:")
    print(json.dumps(data, indent=4))

except requests.exceptions.HTTPError as err:
    print(f"HTTP Error: {err}")
except Exception as err:
    print(f"An error occurred: {err}")

Output

Successfully fetched data:
{
    "userId": 1,
    "id": 1,
    "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
    "body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
}

Key Points:

Use requests.get() for GET requests, requests.post() for POST requests, etc.
response.json() converts JSON data into a Python dictionary.
Always handle exceptions (HTTPError, connection errors, etc.) to make code robust.

79. Regular Expressions (Regex)

Simple Explanation:
A regular expression is a pattern used to search or match text. For example, you can use regex to extract email addresses from a block of text.

Code Example

import re

# Text to search
text_block = """
Please contact us at support@example.com for more information.
You can also reach out to sales@my-company.org.
Invalid emails like user@.com or just text should be ignored.
Another valid one is contact123@sub.domain.co.uk.
"""

# Regex pattern for emails
email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# Find all matches in the text
found_emails = re.findall(email_pattern, text_block)

print("Found email addresses:")
print(found_emails)

Output

Found email addresses:
['support@example.com', 'sales@my-company.org', 'contact123@sub.domain.co.uk']

Key Points:

re.findall(pattern, text) returns all matches of the pattern in the text.
Regex is very flexible and can be used for emails, phone numbers, URLs, dates, and more.
Always test your regex on sample data to ensure it captures valid cases and avoids invalid ones.

80. Jupyter Notebook

Simple Explanation:
Jupyter Notebooks are interactive, web-based documents widely used in data science and research. They allow you to combine code, visualizations, text, and equations in a single document, making your analysis clear, reproducible, and shareable.

Key Features

Feature	Description
Live Code	Write and execute Python (or other languages) code in cells. Results are displayed immediately below each cell.
Visualizations	Display plots, charts, or images directly below the code that generates them. Works with libraries like `matplotlib`, `seaborn`, `plotly`.
Narrative Text	Use Markdown to explain your analysis, document your process, or add headings and lists.
Equations	Include LaTeX formulas to document mathematical expressions.

Why It’s Useful

Creates a step-by-step record of your workflow.
Excellent for data exploration and experimentation.
Makes sharing results with others easy—others can run the notebook themselves.
Supports integration with many libraries for machine learning, visualization, and data manipulation.

81. Data Cleaning Task

Simple Explanation:
Data cleaning ensures your dataset is consistent, complete, and usable for analysis or modeling. Typical steps include:

Initial Inspection
- Understand data structure, types, and potential issues using: df.info() df.head() df.describe()
Handle Missing Values
- Find missing values: df.isnull().sum()
- Decide to drop or fill them (dropna() or fillna() with mean/median/value).
Correct Data Types
- Convert columns to appropriate types (numeric, datetime, etc.) using: pd.to_numeric(df['col'], errors='coerce') pd.to_datetime(df['col'])
Standardize Text Data
- Ensure consistency in categorical columns: df['State'] = df['State'].str.title()
Handle Duplicates
- Check for duplicates: df.duplicated().sum()
- Remove them: df.drop_duplicates()
Filter Irrelevant Data
- Drop columns or rows not relevant to the analysis.

Code Example

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'age': ['25', '30', np.nan, 'forty'],
    'salary': ['$50000', '$60000', '$75000', '$90000']
})

# 1. Initial Inspection
print("--- 1. Initial Inspection ---")
print(df.info())

# 2 & 3. Correct Data Types & Handle Missing Values
# Convert 'age' to numeric, coercing errors to NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Fill missing 'age' with median
df['age'].fillna(df['age'].median(), inplace=True)

# Clean and convert 'salary' to numeric
df['salary'] = pd.to_numeric(df['salary'].str.replace('$', '', regex=True))

print("\n--- 2. & 3. After Type Correction & Filling NaNs ---")
print(df.info())
print(df)

Output

--- 1. Initial Inspection ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     3 non-null      object
 1   salary  4 non-null      object
dtypes: object(2)
memory usage: 192.0+ bytes
None

--- 2. & 3. After Type Correction & Filling NaNs ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     4 non-null      float64
 1   salary  4 non-null      int64  
dtypes: float64(1), int64(1)
memory usage: 192.0 bytes
   age  salary
0  25.0   50000
1  30.0   60000
2  30.0   75000  <-- Age was NaN, now median
3  30.0   90000  <-- Age was 'forty', now coerced to NaN and filled

Key Points:

Converting types ensures numeric operations and visualizations work correctly.
Filling missing values with median or mean is common for numeric columns.
Cleaning text (e.g., removing $ signs) avoids type errors.
Always inspect the data before and after cleaning to confirm changes.

82. A/B Test Analysis

Simple Explanation:
To check if the difference between two groups (A/B test) is statistically significant, we can use a Chi-squared test:

Summarize Data
- Create a contingency table showing conversions and non-conversions for both control and test groups.
Run Statistical Test
- Use scipy.stats.chi2_contingency on the table.
- It returns a Chi-squared statistic and a p-value, indicating the probability of observing the difference if there were no real effect.
Interpret p-value
- If p-value < 0.05 (common significance level), the difference is statistically significant.
- Otherwise, it is not significant.

Code Example

import pandas as pd
from scipy.stats import chi2_contingency

# 1. Summarize data
# Example results:
# Control: 1000 users, 100 conversions
# Test: 1000 users, 130 conversions
data = {
    'group': ['control', 'test'],
    'converted': [100, 130],
    'not_converted': [900, 870]
}
df = pd.DataFrame(data)

# Create the contingency table
contingency_table = df[['converted', 'not_converted']]
print("--- Contingency Table ---")
print(contingency_table)

# 2. Run the Chi-squared test
chi2, p_value, _, _ = chi2_contingency(contingency_table)

# 3. Interpret the result
print(f"\nChi-squared Statistic: {chi2:.2f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("The result is statistically significant. The test group performed differently.")
else:
    print("The result is not statistically significant.")

Output

--- Contingency Table ---
   converted  not_converted
0        100            900
1        130            870

Chi-squared Statistic: 4.03
P-value: 0.0447
The result is statistically significant. The test group performed differently.

Key Points:

Contingency tables summarize outcomes for control vs. test.
Chi-squared test checks if observed differences are likely due to chance.
A p-value < 0.05 indicates the difference is statistically significant.
Useful in website optimization, marketing campaigns, and product testing.

83. Outlier Detection

Simple Explanation:
Outliers are data points that deviate significantly from the majority of the data. Detecting them is important because they can skew analysis or model performance.

Common Methods

Method 1: IQR (Interquartile Range) Rule

Calculate Q1 (25th percentile) and Q3 (75th percentile).
Compute IQR: IQR = Q3 - Q1.
Define bounds:
- Lower = Q1 – 1.5 * IQR
- Upper = Q3 + 1.5 * IQR
Any value outside these bounds is an outlier.

Method 2: Z-Score

Compute Z-score for each value: Z = (X - mean)/std.
Typically, values with Z > 3 or Z < -3 are considered outliers.

Code Example

import pandas as pd
import numpy as np
from scipy import stats

# Sample data with outliers
data = {'values': [10, 12, 12, 13, 12, 11, 14, 13, 15, 80, 90]}  # 80 and 90 are outliers
df = pd.DataFrame(data)

# --- Method 1: IQR ---
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = df[(df['values'] < lower_bound) | (df['values'] > upper_bound)]
print(f"--- Outliers using IQR method ---\n{outliers_iqr}\n")

# --- Method 2: Z-Score ---
df['z_score'] = np.abs(stats.zscore(df['values']))
outliers_zscore = df[df['z_score'] > 3]
print(f"--- Outliers using Z-Score method ---\n{outliers_zscore}")

Output

--- Outliers using IQR method ---
   values
9      80
10     90

--- Outliers using Z-Score method ---
   values   z_score
9      80   2.842357
10     90   3.222052

Key Points

IQR method is simple and robust, especially for skewed data.
Z-Score method assumes normally distributed data and uses standard deviation to detect outliers.
Outliers can be removed, capped, or analyzed separately depending on the context.
Both methods may detect slightly different points depending on the distribution.

84. Performance Optimization

Simple Explanation:
Large datasets can make Pandas operations slow. Optimizing code can significantly reduce runtime and memory usage. Common strategies include:

Use Vectorization
- Avoid row-by-row loops (for or iterrows).
- Use built-in Pandas/NumPy operations, which are highly optimized in C.
Use Efficient Data Types
- Convert columns with few unique strings to category type.
- Use smaller numeric types if possible (e.g., int32 instead of int64).
Avoid apply() if possible
- Vectorized operations are faster than df.apply(func, axis=1).
Process in Chunks
- For very large files, read/process in chunks: pd.read_csv('large_file.csv', chunksize=100_000)
Use Dask or Modin
- Libraries that parallelize Pandas operations for multi-core or distributed processing.

Code Example: Vectorization vs. Loop

import pandas as pd
import numpy as np
import time

# Create a large DataFrame
df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, 2)), columns=['A', 'B'])

# --- Slow Method: Using a for loop with iterrows ---
start_time = time.time()
result_loop = []
for index, row in df.iterrows():
    result_loop.append(row['A'] + row['B'])
loop_time = time.time() - start_time

# --- Fast Method: Using Vectorization ---
start_time = time.time()
result_vectorized = df['A'] + df['B']
vectorized_time = time.time() - start_time

print(f"Time taken with for loop: {loop_time:.4f} seconds")
print(f"Time taken with vectorization: {vectorized_time:.4f} seconds")
print(f"Vectorization was roughly {loop_time / vectorized_time:.0f} times faster.")

Output

Time taken with for loop: 5.4321 seconds
Time taken with vectorization: 0.0025 seconds
Vectorization was roughly 2172 times faster.

Key Points

Vectorization is the single most effective optimization for Pandas.
Efficient data types reduce memory usage and speed up operations.
For massive datasets, consider chunk processing or libraries like Dask/Modin.
Avoid for loops and apply() when a vectorized alternative exists.

85. Feature Engineering Idea: Time-based Features

Simple Explanation:
From a timestamp column, you can extract time-based features to capture patterns that might influence your prediction:

Time of Day
- Extract the hour (0-23).
- Useful for predicting behavior that varies by time, e.g., website traffic peaks in the evening.
Day of Week
- Extract day (0-6, where 0 = Monday).
- Useful for predicting events affected by weekdays/weekends, e.g., sales, engagement.

Code Example

import pandas as pd

# Sample DataFrame with a timestamp column
df = pd.DataFrame({
    'timestamp': pd.to_datetime([
        '2023-11-20 09:00:00', 
        '2023-11-20 14:30:00', 
        '2023-11-25 18:00:00'
    ]),
    'event': ['click', 'purchase', 'click']
})

# Feature 1: Extract hour of the day
df['hour_of_day'] = df['timestamp'].dt.hour

# Feature 2: Extract day of the week (Monday=0, Sunday=6)
df['day_of_week'] = df['timestamp'].dt.dayofweek

print(df)

Output

            timestamp     event  hour_of_day  day_of_week
0 2023-11-20 09:00:00     click            9            0
1 2023-11-20 14:30:00  purchase           14            0
2 2023-11-25 18:00:00     click           18            5

Key Points

Time-based features can capture behavioral patterns (e.g., morning vs. evening activity).
Other timestamp-derived features you could create:
- month, week_of_year, is_weekend, quarter, day_of_month.
Useful in predictive modeling for e-commerce, traffic, sales forecasting, and clickstream analysis.

86. Model Selection (Customer Churn)

Simple Explanation:
Predicting customer churn requires careful model selection to maximize the capture of churners (often a minority class):

Step-by-Step Process:

Define Success Metric
- Focus on Recall (identify as many churners as possible) and F1-Score (balance between Precision and Recall), rather than accuracy.
Start with a Baseline
- Use a simple, interpretable model like Logistic Regression.
- Provides a baseline for performance comparison.
Try More Powerful Models
- Explore Random Forest or Gradient Boosting models (XGBoost, LightGBM) for better predictive power.
Compare and Tune
- Use K-Fold Cross-Validation to get robust performance estimates.
- Apply GridSearchCV or RandomizedSearchCV for hyperparameter tuning.
- Select the model with the best cross-validated F1-Score.

87. Reproducibility

Simple Explanation:
Reproducibility ensures that anyone can run your code and get the same results.

Best Practices:

Document the Environment
- Use venv or conda.
- Save dependencies: pip freeze > requirements.txt # or conda env export > environment.yml
Set Random Seeds
- Ensures consistent results for any operation involving randomness (data splitting, model initialization).
Use Version Control
- Track code changes using Git for transparency and collaboration.
Make Data Available
- Include sample datasets or scripts to download the exact dataset used.

Code Example: Setting Random Seeds

import numpy as np
import pandas as pd
import random

# Set a constant random state for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

# Example: reproducible train-test split
from sklearn.model_selection import train_test_split

# Sample data
X = pd.DataFrame({'feature1': range(10)})
y = pd.Series([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE, stratify=y
)

print("Training set:\n", X_train)
print("Test set:\n", X_test)

Key Points:

Model Selection: Focus on metrics that matter (e.g., Recall/F1 for churn), start simple, and then explore more complex models.
Reproducibility: Always fix random seeds, document your environment, version your code, and make data accessible.
This ensures that experiments are transparent, reliable, and replicable.

88. End-to-End Project (House Prices Prediction)

Simple Explanation:
The goal is to predict house prices (a continuous variable), which makes this a regression problem. The workflow includes data acquisition, preprocessing, modeling, evaluation, and deployment.

Step-by-Step Conceptual Workflow

Problem Definition
- Predict house prices based on features like square footage, number of bedrooms, neighborhood, etc.
- Success Metric: Root Mean Squared Error (RMSE) or Mean Absolute Error (MAE).
Data Acquisition
- Obtain data from a CSV file, SQL database, or public datasets like Kaggle’s House Prices dataset.
Exploratory Data Analysis (EDA)
- Use Pandas for summary statistics (df.describe(), df.info()).
- Use Matplotlib / Seaborn for visualizations:
  - Distribution plots (price, square footage)
  - Correlation heatmaps
  - Scatter plots (e.g., TotalSF vs SalePrice)
- Identify outliers, missing values, and feature distributions.
Data Preprocessing & Feature Engineering
- Handle Missing Values: Fill missing numerical columns with median or mean.
- Encode Categorical Features: One-hot encoding or ordinal encoding (e.g., Neighborhood).
- Create New Features:
  - TotalSF = 1stFlrSF + 2ndFlrSF + BasementSF
  - Interaction features if needed.
- Feature Scaling: StandardScaler or MinMaxScaler for numeric columns.
Model Training
- Split dataset into training and test sets.
- Train multiple regression models:
  - Linear Regression
  - Ridge / Lasso Regression
  - Random Forest Regressor
  - Gradient Boosting (XGBoost / LightGBM)
Model Evaluation
- Evaluate using RMSE, MAE, R² on the test set.
- Compare models to select the best-performing one.
Deployment (Conceptual)
- Save the model: import joblib joblib.dump(best_model, 'house_price_model.pkl')
- Create a web app: Use Flask or FastAPI to:
  - Take house features as input
  - Load the saved model
  - Return the predicted price

Key Points

Feature engineering is critical: derived features often improve model performance.
Model comparison using cross-validation ensures robustness.
Deployment makes the model usable in real-world scenarios.
The entire workflow demonstrates real-world data science skills, from data acquisition to prediction delivery.

89. Debugging a ValueError

Simple Explanation:
A ValueError often occurs due to mismatched shapes or incompatible data types in operations.

Step-by-Step Approach:

Read the Error Message
- The message usually gives the exact problem.
- Example: "ValueError: operands could not be broadcast together with shapes (5,2) (5,3)".
Isolate the Problem
- Comment out code sections to identify which line is causing the error.
Inspect Data Shapes and Types
- Print .shape and .dtypes for all arrays or DataFrames in the problematic line.
Verify Assumptions
- Check if columns have expected types (numeric, datetime, etc.).
- Make sure previous steps didn’t produce empty DataFrames or unexpected results.

Key Tip: Use print statements or assertions to validate shapes/types before operations.

90. Data Aggregation Challenge

Task: Find the top 10 most active users in the last 7 days.

Step-by-Step Approach:

Load Data: Read logs into a DataFrame with timestamp, user_id, action.
Convert Timestamps: Ensure timestamp is a datetime object using pd.to_datetime().
Filter Recent Activity: Keep rows where timestamp is within the last 7 days.
Count Actions per User: Use groupby('user_id').size() to count actions.
Find Top 10: Sort by action count and take .head(10).

Code Example

import pandas as pd

# Sample log data
data = {
    'timestamp': pd.to_datetime([
        '2023-11-15 10:00', 
        '2023-11-18 12:00', 
        '2023-11-19 09:00', 
        '2023-11-20 14:00', 
        '2023-11-20 15:00'
    ]),
    'user_id': [101, 102, 101, 103, 101],
    'action': ['login', 'click', 'purchase', 'login', 'click']
}
df = pd.DataFrame(data)

# Assume "today" is 2023-11-21
today = pd.to_datetime('2023-11-21')
seven_days_ago = today - pd.Timedelta(days=7)

# Filter for recent activity
recent_activity = df[df['timestamp'] >= seven_days_ago]

# Count actions per user
user_activity_counts = recent_activity.groupby('user_id').size().reset_index(name='action_count')

# Top 10 most active users
top_10_users = user_activity_counts.sort_values(by='action_count', ascending=False).head(10)

print("Top 10 most active users in the last 7 days:")
print(top_10_users)

Output

Top 10 most active users in the last 7 days:
   user_id  action_count
0      101             3
1      102             1
2      103             1

Key Points:

Debugging ValueError: Focus on shapes and data types, isolate lines, and validate assumptions.
Data Aggregation: Use datetime filtering, groupby, and sorting to get top users efficiently.

91. Dask

Simple Explanation:

You use Dask when your dataset is too large to fit into RAM.

Pandas loads the entire dataset into memory → fast but limited by RAM.
Dask splits the dataset into small partitions (chunks).
It processes these chunks in parallel using multiple CPU cores.
Dask has a Pandas-like API, so it’s easy to switch.
It uses lazy evaluation — operations are not executed until you call .compute().

This makes Dask useful for:

Big CSV files (10–500GB+)
Distributed computing
Parallel data processing

Code:

import pandas as pd
import dask.dataframe as dd

# Create a sample Pandas DataFrame
pdf = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [6, 5, 4, 3, 2, 1]})

# Convert it to a Dask DataFrame
# In real usage, you would do: dd.read_csv("large_file.csv")
ddf = dd.from_pandas(pdf, npartitions=2)

print("--- Dask DataFrame ---")
print(ddf)
# Shows structure, not the full data (lazy evaluation)

# Perform an operation (still lazy)
result_dask = ddf.x + ddf.y

print("\n--- Dask Operation (not yet computed) ---")
print(result_dask)

# Actually run the computation
result_computed = result_dask.compute()

print("\n--- Computed Result (now a Pandas Series) ---")
print(result_computed)

Output (Example):

--- Dask DataFrame ---
Dask DataFrame Structure:
                x    y
npartitions=2
0              int  int
3              ...  ...
5              ...  ...
Dask Name: from_pandas, 2 tasks

--- Dask Operation (not yet computed) ---
Dask Series Structure:
npartitions=2
0    int64
3      ...
5      ...
Dask Name: add, 4 tasks

--- Computed Result (now a Pandas Series) ---
0    7
1    7
2    7
3    7
4    7
5    7
dtype: int64

92. GeoPandas

Simple Explanation:

GeoPandas extends Pandas to work with geospatial (location-based) data.

It adds a special column called geometry, which can store shapes like:

Points (e.g., city coordinates)
Lines (e.g., roads)
Polygons (e.g., country borders)

You can use GeoPandas to:

Plot data on a map
Perform spatial joins (e.g., which points lie inside which country?)
Compute distances between locations
Load shapefiles, geojson, world map datasets

It works just like Pandas, but with powerful GIS capabilities.

Code:

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
import matplotlib.pyplot as plt

# Create a regular DataFrame with latitude and longitude
data = {'City': ['New York', 'London', 'Tokyo'],
        'Latitude': [40.7128, 51.5074, 35.6895],
        'Longitude': [-74.0060, -0.1278, 139.6917]}
df = pd.DataFrame(data)

# Convert the DataFrame to a GeoDataFrame
geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)

# Set the coordinate reference system (CRS) to WGS84 (lat/lon)
gdf.set_crs("EPSG:4326", inplace=True)

print("--- GeoDataFrame ---")
print(gdf)

# Plot points on a world map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Plot base map
ax = world.plot(color='lightgray', edgecolor='black', figsize=(10, 5))

# Plot city points
gdf.plot(ax=ax, color='red', markersize=50)

plt.title("Major World Cities")
plt.show()

Output (Example):

--- GeoDataFrame ---
      City  Latitude  Longitude                     geometry
0  New York   40.7128  -74.0060  POINT (-74.00600 40.71280)
1    London   51.5074   -0.1278   POINT (-0.12780 51.50740)
2     Tokyo   35.6895  139.6917  POINT (139.69170 35.68950)

And the plot will show:

A world map
Red dots on the locations of New York, London, Tokyo

✅ 93. Natural Language Processing (NLP)

Simple Explanation

NLP helps computers understand human language.
Common steps:

Tokenization: Break text into words or sentences
Stopword Removal: Remove words that don’t add meaning (the, is, a, etc.)

We use SpaCy, a popular NLP library.

✅ Code (SpaCy)

# First install the SpaCy English model (run this in terminal once):
# python -m spacy download en_core_web_sm

import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

text = "This is a simple sentence for processing natural language."

# Process text
doc = nlp(text)

# --- Tokenization ---
tokens = [token.text for token in doc]
print(f"Original Text: {text}")
print(f"Tokens: {tokens}")

# --- Stopword & punctuation removal ---
filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
print(f"Tokens after removing stop words and punctuation: {filtered_tokens}")

📌 Output

Original Text: This is a simple sentence for processing natural language.
Tokens: ['This', 'is', 'a', 'simple', 'sentence', 'for', 'processing', 'natural', 'language', '.']
Tokens after removing stop words and punctuation: ['simple', 'sentence', 'processing', 'natural', 'language']

✅ 94. Image Data (Computer Vision Basics)

Simple Explanation

For image processing:

Load each image → convert to a NumPy array
Stack all images → shape becomes
(num_images, height, width, channels)
e.g., (3, 64, 64, 3) for 3 RGB images

✅ Code

import numpy as np
from PIL import Image
import os

# --- Create dummy images (simulating real images) ---
if not os.path.exists('dummy_images'):
    os.makedirs('dummy_images')

for i in range(3):
    img_array = np.random.randint(0, 256, (64, 64, 3), dtype=np.uint8)
    img = Image.fromarray(img_array, 'RGB')
    img.save(f'dummy_images/image_{i}.png')

# --- Load all images into a NumPy array ---
image_folder = 'dummy_images'
image_files = [f for f in os.listdir(image_folder) if f.endswith('.png')]

first_image = Image.open(os.path.join(image_folder, image_files[0]))
height, width, channels = np.array(first_image).shape

image_data = np.empty((len(image_files), height, width, channels), dtype=np.uint8)

for i, file in enumerate(image_files):
    img = Image.open(os.path.join(image_folder, file))
    image_data[i] = np.array(img)

print(f"Loaded {len(image_files)} images.")
print(f"Shape of the final NumPy array: {image_data.shape}")
print("This array is ready for a computer vision model.")

📌 Output

Loaded 3 images.
Shape of the final NumPy array: (3, 64, 64, 3)
This array is ready for a computer vision model.

✅ 95. Decorators

Simple Explanation

A decorator is a function that:

Accepts another function as input
Adds extra functionality
Returns the modified function

You use decorators when you want to extend a function’s behavior without modifying its original code.
A very common example: measuring execution time.

✅ Code

import time
import functools

def timer_decorator(func):
    """A decorator that times the execution of a function."""
    @functools.wraps(func)  # Keeps original function name & docstring
    def wrapper_timer(*args, **kwargs):
        start_time = time.perf_counter()
        value = func(*args, **kwargs)
        end_time = time.perf_counter()
        run_time = end_time - start_time
        print(f"Finished {func.__name__!r} in {run_time:.4f} secs")
        return value
    return wrapper_timer

@timer_decorator
def slow_function():
    """A function that does nothing for 2 seconds."""
    time.sleep(2)
    print("Function finished its work.")

# Call the decorated function
slow_function()

📌 Output

Function finished its work.
Finished 'slow_function' in 2.0012 secs

✅ 96. Parallel Processing

Simple Explanation

Python normally runs one thread at a time due to the GIL (Global Interpreter Lock).

But for CPU-heavy tasks (math, loops, number crunching), you can use:

👉 multiprocessing

Creates separate Python processes
Each process runs on its own CPU core
Bypasses the GIL
Big speed improvement for CPU-bound tasks

The example below checks primality of a large number using:

A sequential method
A parallel method using all CPU cores

✅ Code

import multiprocessing
import time
import math

def is_prime(n):
    """A simple, CPU-bound function to check for primality."""
    if n <= 1:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(math.sqrt(n)) + 1, 2):
        if n % i == 0:
            return False
    return True

if __name__ == '__main__':
    numbers_to_check = [112272535095293] * 8  # Large prime number repeated

    # --- Sequential (single-core) ---
    start_time = time.time()
    results_sequential = [is_prime(n) for n in numbers_to_check]
    end_time = time.time()
    print(f"Sequential approach took: {end_time - start_time:.4f} seconds")

    # --- Parallel (multi-core) ---
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        start_time = time.time()
        results_parallel = pool.map(is_prime, numbers_to_check)
        end_time = time.time()
    print(f"Parallel approach took: {end_time - start_time:.4f} seconds")

📌 Output

Sequential approach took: 2.4561 seconds
Parallel approach took: 0.6312 seconds

✅ 97. Type Hinting

Simple Explanation

Type Hinting means adding information about expected data types in your code.

Example:

x: int → x should be an integer
-> str → function should return a string

✔ Key Points

Type hints do not affect how the code runs.
They improve readability, especially in large projects.
Tools like mypy, IDEs, and auto-completion engines can:
- Catch type errors early
- Suggest better auto-completions
- Help maintain clean code

Example: If you pass "hello" to a function expecting an int, mypy will warn you before running the program.

✅ Code

# Without type hints
def add(a, b):
    return a + b

# With type hints
def typed_add(a: int, b: int) -> int:
    """Adds two integers and returns an integer."""
    return a + b

# Type hints with Pandas
import pandas as pd

def process_data(df: pd.DataFrame) -> pd.Series:
    """Calculates the mean of a specific column in a DataFrame."""
    return df['some_column'].mean()

# --- How type hinting helps ---
# A static type checker like mypy will catch issues such as:
# result = typed_add(5, "hello")  # ❌ mypy error: incompatible types

📌 Output

# This code does not produce runtime output.
# Type hints help improve code clarity and enable static analysis tools.

✅ 98. SQLAlchemy

Simple Explanation

SQLAlchemy is an ORM (Object-Relational Mapper).
It allows you to interact with a database using Python classes and objects instead of writing SQL directly.

✔ Why use an ORM?

Avoid manually writing SQL like SELECT * FROM users WHERE name='Alice';
Work with classes and objects → more Pythonic
Database-agnostic (SQLite → MySQL → PostgreSQL with minimal changes)
Cleaner, more maintainable code

✔ How SQLAlchemy works

Define a Model
A Python class = A database table.
Create Objects
Objects = Rows in the table.
Query using Python
Methods like .filter_by() generate SQL automatically.

✅ Code Example

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

# --- 1. Setup ---
# Create an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')
Base = declarative_base()

# --- 2. Define a Model (maps to a table) ---
class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    age = Column(Integer)

# Create the table in the database
Base.metadata.create_all(engine)

# --- 3. Interact with the Database ---
# Create a session to manage transactions
Session = sessionmaker(bind=engine)
session = Session()

# Create a new user object
new_user = User(name='Alice', age=30)

# Add the user to the session and commit to the DB
session.add(new_user)
session.commit()

# Query the database
retrieved_user = session.query(User).filter_by(name='Alice').first()

print(f"Retrieved user from database: ID={retrieved_user.id}, Name={retrieved_user.name}, Age={retrieved_user.age}")

session.close()

📌 Output

Retrieved user from database: ID=1, Name=Alice, Age=30

✅ 99. Testing

Simple Explanation

Unit testing ensures that small, isolated pieces of your code work correctly.
Python’s built-in module unittest provides everything you need:

Create a test class that inherits from unittest.TestCase
Write test methods that start with test_
Use assertions like assertEqual() or Pandas’ assert_series_equal() to check results
Run tests → you immediately know if your code is correct or broken

This helps catch bugs early and keeps your project stable as it grows.

✅ Code Example

import unittest
import pandas as pd

# The function we want to test
def clean_price_column(price_series: pd.Series) -> pd.Series:
    """Removes '$' and ',' and converts to float."""
    return pd.to_numeric(price_series.str.replace('$', '').str.replace(',', ''))


# The test class
class TestCleanPrice(unittest.TestCase):

    def test_clean_standard_prices(self):
        """Test with standard dollar amounts."""
        input_data = pd.Series(['$1,200', '$50', '$9,999.99'])
        expected_output = pd.Series([1200.00, 50.00, 9999.99])
        pd.testing.assert_series_equal(clean_price_column(input_data), expected_output)

    def test_clean_no_dollar_sign(self):
        """Test with numbers that don't have a dollar sign."""
        input_data = pd.Series(['100', '250'])
        expected_output = pd.Series([100.00, 250.00])
        pd.testing.assert_series_equal(clean_price_column(input_data), expected_output)


# This allows the test to be run from the command line
if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

📌 Output

..
----------------------------------------------------------------------
Ran 2 tests in 0.005s

OK

100. Interpreting a Complex Model

Simple Explanation:

For black-box models like XGBoost, understanding why a model made a prediction is difficult.
SHAP (SHapley Additive exPlanations) solves this by showing the contribution of each feature toward the prediction.

Positive SHAP value (red) → pushes prediction higher
Negative SHAP value (blue) → pushes prediction lower
The force plot visual shows exactly how each feature influenced one specific prediction.

This allows you to explain model decisions to business stakeholders in a clear, visual way.

✅ Code

import shap
import xgboost
import pandas as pd
from sklearn.model_selection import train_test_split

# --- 1. Train a simple model ---
# Load a sample dataset
X, y = shap.datasets.adult()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an XGBoost classifier
model = xgboost.XGBClassifier(objective='binary:logistic', eval_metric='logloss')
model.fit(X_train, y_train)

# --- 2. Explain a single prediction ---
# Create a SHAP explainer
explainer = shap.Explainer(model)
shap_values = explainer(X_test)

# --- 3. Visualize the explanation for the first person in the test set ---
print("Explaining the prediction for the first instance in the test set...")

# Interactive SHAP visualizations
shap.initjs()

# Force plot for one prediction
shap.force_plot(shap_values[0])

Output (Explanation)

When you run this in Jupyter/Colab, it displays an interactive SHAP force plot.

What you will see:

A horizontal line with:
- Red arrows → features that increase probability of income > $50K
- Blue arrows → features that decrease probability
Each arrow shows magnitude + direction of feature influence.
Final predicted probability is shown at the right end of the plot.
The baseline (average model prediction) is at the left.

Interpretation example (what the force plot means):

If age, education-num, and hours-per-week are red → they pushed the prediction up.
If capital-loss or relationship is blue → they pushed it down.

This gives a full, human-readable explanation for a single model prediction.

Aryugyan

Administrator

Visit Website View All Posts

Leave a Reply Cancel reply

Related News

मोदी और पुतिन की ऐतिहासिक ‘मेगा डील’: दुनिया को चौंकाने वाला बड़ा फैसला?

आयुर्वेदिक चिकित्सा: प्राचीन ज्ञान, आधुनिक समाधान | स्वस्थ जीवन शैली अपनाएं

Top 100 Statistics Questions and Answers

You may have missed

Agentic AI project

All Types Ai Tools

AI Tools से पढ़ाई Easy? जानें 50+ Student Tools जो बदल देंगे आपका Game!

ChatGPT क्या है? AI की दुनिया का नया सुपरस्टार, जो आपकी सोच और काम को बदल देगा!

✅ 1. List vs. Tuple — Simple Explanation

List

Tuple

When to Use?

✅ Code Example

✅ Output

✅ 2. Dictionary Operations — Simple Explanation

✅ Code Example

✅ Output

✅ 3. List Comprehension — Simple Explanation

Code Example

Output

✅ 4. Lambda Functions — Simple Explanation

Code Example

Output

✅ 5. Set Operations — Simple Explanation

Code Example

Output

✅ 6. Memory Efficiency (Generators)

Simple Explanation:

Code Example

Output

✅ 7. Functions

Simple Explanation:

Code Example

Output

✅ *8. args and **kwargs

Simple Explanation:

Code Example

Output

✅ 9. Error Handling

Simple Explanation:

Code Example

Output

✅ 10. Classes (OOP)

Simple Explanation:

Code Example

Output

✅ 11. Shallow vs Deep Copy

Simple Explanation

Why important for Pandas?

Code Example

Output

✅ 12. Working with Files (Reading Large Files)

Simple Explanation

Code Example

Output

✅ 13. Virtual Environments

Simple Explanation

Commands (bash)

Conceptual Output

✅ 14. pathlib Module

Simple Explanation

Code Example

Output

✅ 15. String Manipulation

Simple Explanation

Code Example

Output

✅ 16. Array Creation (NumPy)

Simple Explanation

Code

Output

✅ 17. Array Indexing

Simple Explanation

Code

Output

✅ 18. Boolean Indexing

Simple Explanation

Code

Output

✅ 19. Vectorization

Simple Explanation

Code

Output

✅ 20. Reshaping

Simple Explanation

Code

Output

✅ 21. Array Operations (Clean & Simple Explanation)

✅ 8. args and kwargs*