1. Difference Between Descriptive and Inferential Statistics

Descriptive Statistics

Descriptive statistics involves collecting, organizing, summarizing, and presenting data in a meaningful way.
It focuses only on the data you currently have.

Purpose: To describe the basic features of the dataset.
Tools: Mean, median, mode, range, variance, standard deviation.
Examples:
- Calculating average marks of a class
- Creating frequency tables
- Visualizations like histograms, bar charts, pie charts

Inferential Statistics

Inferential statistics uses data from a sample to make generalizations, predictions, or decisions about a larger population.

Purpose: To draw conclusions beyond the immediate dataset.
Tools/Methods: Hypothesis testing, confidence intervals, regression, ANOVA, chi-square tests.
Examples:
- Predicting election results using survey samples
- Estimating average income of a city using a sample
- Testing whether a new medicine is effective

📌 Easy Example

You collect the heights of 100 students (sample) from a university (population):

Calculating the average height of the 100 students → Descriptive Statistics
Using that sample average to estimate the average height of all students in the university → Inferential Statistics

2. Define Population and Sample. How do they differ?

Population

A population is the entire group of individuals, items, or data you want to study or draw conclusions about.

Example: All registered voters in a country.
Characteristics: Large, sometimes infinite.

Sample

A sample is a subset of the population, selected for analysis.
It should be representative so that conclusions about the population are accurate.

Example: 1,000 randomly selected registered voters.
Characteristics: Smaller, manageable, easier to collect data from.

🔍 Key Differences (Table)

Feature	Population	Sample
Size	Large or infinite	Small, finite
Accessibility	Difficult to study entirely	Easier to access and measure
Use	Whole group of interest	Part used to make population inferences

3. What are Measures of Central Tendency? (With Examples)

Measures of central tendency are statistical values that identify the center, typical value, or average of a dataset.
The three main measures are Mean, Median, and Mode.

1️⃣ Mean (Arithmetic Average)

The mean is the sum of all observations divided by the total number of observations.

➡️ Sensitive to outliers
If you add 500 to the dataset: [10, 20, 30, 40, 500],
mean becomes much larger → distorted.

2️⃣ Median

The median is the middle value when the data is sorted.

Example (Odd number of values):

Dataset: [5, 12, 18]
Middle value = 12

Example (Even number of values):

Dataset: [3, 7, 11, 20]
Middle two values = 7 and 11

➡️ Not affected by outliers, so ideal for skewed data.

3️⃣ Mode

The mode is the value that occurs most frequently.

Example (Numerical):

Dataset: [2, 4, 4, 5, 7]
Mode = 4

Example (Categorical):

Dataset: [“Red”, “Red”, “Blue”]
Mode = Red

➡️ Datasets can be unimodal, bimodal, or multimodal.

✔️ When to Use Which?

Measure	Best Used When
Mean	Data is symmetric and has no outliers
Median	Data is skewed or contains outliers
Mode	Data is categorical or when finding the most common value

4. What is the Range, and How Is It Calculated?

The range is a measure of dispersion that shows how spread out the values in a dataset are.
It is the simplest measure of variability.

✅ Example

For the dataset: [4, 7, 10, 15]

Maximum value = 15
Minimum value = 4

Range=15−4=11

➡️ This means the data values spread over 11 units.

� Limitation : Sensitive to outliers.

5. Define Variance and Standard Deviation. How Are They Related?

📌 Variance

Variance measures how far each data point is from the mean, on average.
It is calculated as the average of the squared deviations from the mean.

High variance → data is widely spread
Low variance → data is tightly clustered

📌 Standard Deviation (SD)

Standard deviation is the square root of the variance.
It tells us the average distance of each point from the mean in the same units as the data.

If variance increases → SD increases
If variance decreases → SD decreases

They always move together.

📌 Population vs Sample Variance (Formulas)

6. What is Skewness in a Distribution?

Skewness measures the asymmetry of a probability distribution around its mean.
A perfectly symmetric distribution (like a normal distribution) has skewness = 0.

If the distribution is not symmetric, it is skewed.

Types of Skewness

1️⃣ Positive Skew (Right Skew)

The tail extends to the right (toward larger values).
Most data points are concentrated on the left.
Mean > Median > Mode
Outliers are on the higher end.

Example:

Income distribution (few very high incomes pull the tail to the right)

2️⃣ Negative Skew (Left Skew)

The tail extends to the left (toward smaller values).
Most data points are concentrated on the right.
Mean < Median < Mode
Outliers are on the lower end.

Example:

House prices in a declining market
Scores on an easy exam (many high scores, few low)

📌 Visual Tip

👉 The direction of the tail = the direction of the skew.

Tail to the right → Right (positive) skew
Tail to the left → Left (negative) skew

7. Explain Kurtosis and Its Types

Kurtosis measures the tailedness of a probability distribution—how heavy or light the tails are compared to a normal distribution.
It also indicates how sharp or flat the peak is, but the main focus is on the tails.

Types of Kurtosis

1️⃣ Mesokurtic

Has moderate tails and a moderate peak.
Represents a normal distribution.
Serves as the baseline for comparison.

2️⃣ Leptokurtic

Has heavy tails and a sharp peak.
More prone to extreme values/outliers.
Indicates higher kurtosis than normal.

Example:

Financial returns (because of frequent extreme highs and lows)

3️⃣ Platykurtic

Has light tails and a flat, broad peak.
Fewer extreme values than a normal distribution.
Indicates lower kurtosis.

Example:

Uniform distribution

📌 Important Note

High kurtosis does not mean “more peaked” only.
It means more data in the tails and the central peak → more extreme outcomes.

8. What is a box plot, and what information does it convey?

A box plot (also called a box-and-whisker plot) is a graphical representation that summarizes the distribution of a dataset using its five-number summary:

Minimum
First Quartile (Q1)
Median (Q2)
Third Quartile (Q3)
Maximum

It also helps you detect outliers easily.

🔍 Interpretation

A box plot visually conveys:

Box (Q1 to Q3):
Represents the Interquartile Range (IQR) = Q3 − Q1
→ This is where the middle 50% of the data lies.
Median line inside the box:
Shows the central value of the dataset.
Whiskers:
Extend to the smallest and largest values within 1.5 × IQR from Q1 and Q3.
Points outside whiskers:
These are flagged as outliers, indicating unusually high or low values.

✨ Simple Visual Summary

Part of Box Plot	Meaning
Box (Q1–Q3)	Middle 50% of data
Line inside box	Median
Whiskers	Spread of normal range
Dots outside	Outliers

✅ Python Example

import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 100]  # outlier at 100

plt.boxplot(data)
plt.title('Box Plot')
plt.ylabel('Values')
plt.show()

9. How do you interpret a histogram?

A histogram is a graphical representation that shows the distribution of numerical data by grouping values into bins and displaying their frequency.

🔍 How to Interpret a Histogram

1. Shape

Symmetric (Bell-shaped): Indicates a normal distribution.
Right-skewed: Long tail on the right → many small values, few large values.
Left-skewed: Long tail on the left → many large values, few small values.

2. Center

The value range where the majority of the data points lie.
Often corresponds to the “peak” or highest bars.

3. Spread

The width of the histogram across the x-axis.
Wider spread = larger variability; narrow spread = low variability.

4. Outliers

Bars that appear far away from the main concentration of data.
May indicate unusual or extreme values.

5. Modality (Number of Peaks)

Unimodal: One peak
Bimodal: Two peaks (may indicate two groups in data)
Multimodal: More than two peaks

✅ Python Example

import matplotlib.pyplot as plt 
import numpy as np

# Generate random data 
data = np.random.normal(loc=50, scale=10, size=1000)

plt.hist(data, bins=20, color='skyblue', edgecolor='black')
plt.title('Histogram of Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

✅ 10. What is the Empirical Rule in statistics?

The Empirical Rule, also called the 68-95-99.7 Rule, applies only to normally distributed data.

📘 Rule Explanation

68% of data lies within ±1 standard deviation of the mean
95% lies within ±2 standard deviations
99.7% lies within ±3 standard deviations

📌 Example

If test scores are normally distributed with:

Mean = 70
Standard Deviation = 10

Then:

68% scored between 60 and 80
95% scored between 50 and 90
99.7% scored between 40 and 100

🎯 Usefulness

Helps identify outliers
Useful for prediction and probability estimation
Helps verify if data is approximately normal

11. Define probability and its axioms.

Probability

Probability is a numerical measure of how likely an event is to occur.
Its value always lies between 0 and 1:

0 → impossible event
1 → certain event
Values between 0 and 1 represent varying degrees of likelihood.

Axioms of Probability (Kolmogorov’s Axioms)

Let S be a sample space and A be any event.

Axiom 1: Non-negativity

No event can have a negative probability.

Axiom 2: Normalization

The probability that some outcome in the sample space occurs is always 1.

Axiom 3: Additivity (Mutually Exclusive Events)

12. Difference between Independent and Mutually Exclusive Events

Here’s the concept explained in interview-friendly table form:

Feature	Independent Events	Mutually Exclusive Events
Definition	Occurrence of one event does NOT affect the probability of the other	Two events cannot occur together
Mathematically	P(A∩B)=P(A)P(B)P(A \cap B) = P(A)P(B)P(A∩B)=P(A)P(B)	P(A∩B)=0P(A \cap B) = 0P(A∩B)=0
Example	Getting heads on coin 1 and tails on coin 2	Rolling a 3 and rolling a 5 on a single die

13 Explain conditional probability with an example.

efinition:
Conditional probability is the probability of an event occurring given that another event has already occurred. It is represented as:

Example:

In a class of 30 students:

18 passed Math (MMM)
12 passed English (EEE)
9 passed both subjects

We want to find the probability that a student passed Math given that they passed English:

✅ Conclusion: The probability that a student passed Math given that they passed English is 0.75 or 75%.

14. What is Bayes’ Theorem?

Bayes’ Theorem is a formula used to find the probability of an event based on prior knowledge of conditions that might be related to the event.

It updates the probability of an event when new evidence is introduced.

Real-World Application of Bayes’ Theorem

1. Medical Diagnosis

Doctors use Bayes’ theorem to update the probability of a disease after a test result.

Example:
A patient takes a COVID test.

P(Disease) = Prior probability based on population infection rate
P(Positive Test | Disease) = Sensitivity of test
P(Positive Test) = Overall rate of positive results

Using Bayes’ theorem, doctors can calculate:
➡️ Probability the patient actually has COVID given a positive test result.

This helps in:

Reducing false alarms
Making accurate medical decisions

✅ 15. Define and differentiate between discrete and continuous random variables

Discrete Random Variable

A variable that can take countable values (finite or countably infinite).

Continuous Random Variable

A variable that can take any value in a continuous interval, i.e., uncountably infinite values.

Difference Table

Feature	Discrete Random Variable	Continuous Random Variable
Values	Countable (e.g., integers)	Any value in a range (real numbers)
Examples	Number of heads in 10 coin tosses, number of students	Height, weight, time, temperature
Probability Function	PMF – Probability Mass Function	PDF – Probability Density Function
Cumulative Distribution	Sum of probabilities	Integral of the density function
Probability of a Single Value	Can be > 0	Always 0 (P(X = a) = 0)
Representation	Bars or discrete points	Smooth continuous curve

✅ 16. What is a Probability Mass Function (PMF)?

A PMF gives the probability that a discrete random variable takes a specific value.

✅ 17. What is a Probability Density Function (PDF)?

A PDF describes the relative likelihood for a continuous random variable to take on a particular value.

The PDF describes the relative likelihood of a continuous random variable taking on a particular value. Unlike PMF, the PDF does not give actual probabilities directly; instead, the area under the curve represents probability

✅ 19. Explain the properties of a Normal Distribution

A normal distribution is a symmetric, bell-shaped probability distribution defined by:

Mean (μ)
Standard deviation (σ)

Key Properties

Symmetric around the mean
Mean = Median = Mode
Total area under the curve = 1
Tails extend to ±∞
Follows the Empirical Rule:
- 68% of data within 1σ
- 95% within 2σ
- 99.7% within 3σ
The standard normal distribution is

✅ Python Code – Plotting Normal Distribution

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate data
x = np.linspace(-4, 4, 1000)
y = norm.pdf(x, 0, 1)

plt.plot(x, y, label='N(0,1)')
plt.title('Standard Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()

✅ 20. What is the Central Limit Theorem (CLT), and why is it important?

Central Limit Theorem (CLT)

The Central Limit Theorem states that:

When sample size is sufficiently large, the sampling distribution of the sample mean becomes approximately normal, regardless of the population’s original distribution.

This holds true even if the population is skewed, uniform, or non-normal.

✅ Why CLT Is Important

✔ Allows us to use parametric statistical tests (t-test, z-test, ANOVA) even when the population isn’t normal.
✔ Foundation of confidence intervals for means.
✔ Enables hypothesis testing using sampling distributions.
✔ Makes inference possible using sample means instead of population data.
✔ Used in machine learning, statistics, and probability for approximation.

📌 Example – Simulating CLT in Python

import numpy as np
import matplotlib.pyplot as plt

# Parameters
population_size = 10000
sample_size = 50
num_samples = 1000

# Create skewed population (exponential)
population = np.random.exponential(scale=2.0, size=population_size)

# Take multiple samples and compute their means
sample_means = [np.mean(np.random.choice(population, size=sample_size)) 
                for _ in range(num_samples)]

# Plot histogram of sample means
plt.hist(sample_means, bins=30, edgecolor='black', alpha=0.7)
plt.title('Distribution of Sample Means (CLT)')
plt.xlabel('Sample Mean')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

📌 Observation

Even though the original population was exponential (highly skewed), the distribution of sample means becomes approximately normal, confirming the Central Limit Theorem.

✅ 21. What is a null hypothesis? How does it differ from an alternative hypothesis?

Null Hypothesis (H₀)

States that there is no effect, no difference, or no relationship.
Assumes that any observed differences are due to random chance.
It is the hypothesis we usually test against and often try to reject.

Alternative Hypothesis (H₁ or Hₐ)

States that there is an effect, a difference, or a relationship.
It contradicts the null hypothesis.
Represents what we are trying to prove with evidence.

Example

Testing if a new drug improves memory:

H₀: The drug has no effect on memory.
H₁: The drug improves memory.

✅ 22. Define Type I and Type II errors.

Error Type	Description	Symbol	Example
Type I Error	Rejecting a true null hypothesis (false positive)	α	Saying a healthy person has a disease
Type II Error	Failing to reject a false null hypothesis (false negative)	β	Saying a sick person is healthy

✔ Probability of correctly rejecting a false null hypothesis
✔ Higher power = better test

✅ 23. What is a p-value? How is it interpreted?

Definition

A p-value is the probability of obtaining a result as extreme or more extreme than the observed outcome assuming the null hypothesis is true.

Interpretation Guidelines

If p-value < α (e.g., 0.05) → Reject H₀
If p-value ≥ α → Fail to reject H₀

🔍 A small p-value means strong evidence against the null hypothesis.

Example

Testing whether a coin is fair:

p-value = 0.01
Since 0.01 < 0.05 → we reject H₀
→ The coin is likely not fair.

✅ 24. Explain the concept of statistical significance.

A result is statistically significant if it is unlikely to have occurred by random chance, assuming the null hypothesis is true.

Key Points

Determined by comparing the p-value with α (usually 0.05)
Statistically significant ≠ practically important
Depends on:
- Effect size
- Sample size
- Data variability

Example

A study finds:

Students score 0.5 points higher
p = 0.03

This is statistically significant, but the improvement may be too small to matter in real life (not practically meaningful).

✅ 25. What is a confidence interval? How is it constructed?

Definition

A confidence interval (CI) gives a range of values that is likely to contain the true population parameter.

Example:
A 95% CI means:

If we draw many samples and compute CIs, 95% of them will contain the true mean.

Where:

xˉ\bar{x}xˉ = sample mean
z\*z^\*z\* = critical z-value (1.96 for 95% CI)
σ\sigmaσ = population or sample standard deviation
nnn = sample size

📌 Python Example – Confidence Interval

import numpy as np
from scipy.stats import norm

# Sample data
data = [24, 27, 19, 23, 25, 28, 21]
mean = np.mean(data)
std_dev = np.std(data, ddof=1)  # sample std
n = len(data)
z = norm.ppf(0.975)  # 95% CI

# Calculate CI
margin_error = z * (std_dev / np.sqrt(n))
ci = (mean - margin_error, mean + margin_error)

print(f"95% Confidence Interval: ({ci[0]:.2f}, {ci[1]:.2f})")

output:- 95% Confidence Interval: (21.50, 26.22)

✅ 26. Differentiate between one-tailed and two-tailed tests

Feature	One-Tailed Test	Two-Tailed Test
Direction	Tests for effect in one specific direction	Tests for effect in either direction
Hypotheses	H0:μ=μ0H_0: \mu = \mu_0H0:μ=μ0
H1:μ>μ0H_1: \mu > \mu_0H1:μ>μ0 or H1:μ<μ0H_1: \mu < \mu_0H1:μ<μ0	H0:μ=μ0H_0: \mu = \mu_0H0:μ=μ0
H1:μ≠μ0H_1: \mu \ne \mu_0H1:μ=μ0
Rejection Region	Only on one side of distribution	On both sides of distribution

Examples

One-tailed: Is the new teaching method better than the old?
Two-tailed: Is the new teaching method different from the old?

✅ 27. When would you use a t-test versus a z-test?

Feature	Z-Test	T-Test
Population SD	Known	Unknown
Sample Size	Large (n ≥ 30)	Small or large (commonly n < 30)
Distribution	Uses normal distribution	Uses Student’s t-distribution

Types of T-Tests

One-sample t-test
Independent samples t-test
Paired t-test

📌 Python Example – T-Test

from scipy.stats import ttest_ind

group1 = [20, 22, 19, 18, 24]
group2 = [25, 27, 26, 23, 24]

t_stat, p_val = ttest_ind(group1, group2)
print(f"T-statistic: {t_stat:.3f}, p-value: {p_val:.3f}")

output :- T-statistic: -3.415, p-value: 0.009

✅ 28. What is an ANOVA test, and when is it applicable?

ANOVA (Analysis of Variance) is used to compare the means of three or more groups to determine whether at least one group mean is significantly different.

Use Cases

Comparing test scores across multiple schools
Evaluating effectiveness of different drugs
Comparing sales performance across regions

Assumptions

Independence of observations
Normality of groups
Homogeneity of variances (equal variances)

📌 Python Example – One-Way ANOVA

from scipy.stats import f_oneway

group1 = [20, 22, 24, 19, 21]
group2 = [25, 27, 26, 23, 24]
group3 = [18, 20, 19, 17, 22]

f_stat, p_val = f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.3f}, p-value: {p_val:.3f}")

output:- F-statistic: 13.152, p-value: 0.001

✅ 29. Explain the chi-square test and its applications.

The chi-square test evaluates whether there is a significant association between categorical variables.

Types

Chi-square Goodness-of-Fit Test
- Compares observed vs expected frequencies
Chi-square Test of Independence
- Checks relationship between two categorical variables

Example Question

Is there a relationship between gender and product preference?

📌 Python Example – Chi-Square Test

from scipy.stats import chi2_contingency

# Contingency table
observed = [
    [20, 10],  # Male preferences
    [15, 15]   # Female preferences
]

chi2, p, dof, expected = chi2_contingency(observed)
print(f"Chi-square statistic: {chi2:.3f}, p-value: {p:.3f}")

output:- Chi-square statistic: 1.097, p-value: 0.295

✅ 30. What is the purpose of an F-test?

An F-test compares two variances to determine if they are equal.
It is also used in ANOVA to compare variance between groups vs within groups.

Hypotheses

H₀: Variances are equal
H₁: Variances are not equal

📌 Python Example – F-Test

import numpy as np
import scipy.stats as stats

sample1 = [20, 22, 24, 19, 21]
sample2 = [25, 27, 26, 23, 24]

var1 = np.var(sample1, ddof=1)
var2 = np.var(sample2, ddof=1)

f_stat = var1 / var2
p_val = stats.f.sf(f_stat, len(sample1)-1, len(sample2)-1)

print(f"F-statistic: {f_stat:.3f}, p-value: {p_val:.3f}")

output:- F-statistic: 1.480, p-value: 0.357

✅ 31. What is linear regression? Provide an example.

Linear Regression is a statistical technique used to model the relationship between a dependent variable (target) and one or more independent variables (predictors) assuming a linear relationship.

Simple Linear Regression Equation

y=β0+β1x+εy = \beta_0 + \beta_1x + \varepsilony=β0+β1x+ε

Where:

yyy: dependent variable
xxx: independent variable
β0\beta_0β0: intercept
β1\beta_1β1: slope
ε\varepsilonε: error term

Example:

Predicting house price based on square footage.

📌 Python Example – Simple Linear Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([[50], [80], [100], [120], [150]])  # Square feet
y = np.array([150, 200, 250, 300, 400])          # Price in thousands

# Model training
model = LinearRegression()
model.fit(X, y)

# Plotting
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.title('Simple Linear Regression')
plt.xlabel('Square Feet')
plt.ylabel('Price ($k)')
plt.grid(True)
plt.show()

print(f"Slope: {model.coef_[0]:.2f}, Intercept: {model.intercept_:.2f}")

✅ 32. Explain the assumptions of linear regression.

To ensure accurate, reliable results, linear regression relies on these five key assumptions:

Linearity
Relationship between predictors and target is linear.
Independence of Errors
Residuals are independent across observations.
Homoscedasticity
Variance of residuals is constant across values of X.
Normality of Errors
Residuals should be approximately normally distributed.
No Multicollinearity
Predictors should not be strongly correlated with each other.

👉 Violations cause biased coefficients, wrong p-values, and unreliable predictions.

✅ 33. What is multicollinearity, and how can it be detected?

Multicollinearity occurs when two or more independent variables are highly correlated, making it difficult to determine their individual impact on the target.

How to Detect?

Correlation Matrix → High correlation between features
Variance Inflation Factor (VIF) →
- VIF > 5 = moderate
- VIF > 10 = high multicollinearity

📌 Python Example – Calculate VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

# Sample data
data = pd.DataFrame({
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 4, 6, 8, 10],  # Highly correlated with X1
    'X3': [1, 1, 2, 2, 3]
})

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Feature"] = data.columns
vif_data["VIF"] = [
    variance_inflation_factor(data.values, i)
    for i in range(len(data.columns))
]

print(vif_data)

output:-

Feature VIF
0 X1 inf
1 X2 inf
2 X3 49.761905

✅ 34. Define R-squared and adjusted R-squared.

R-squared (R²)

Measures how much of the variation in the dependent variable is explained by the model.
Range: 0 to 1
Higher R² → better model fit.

Adjusted R-squared

Adjusts R² for number of predictors.
Prevents artificially increasing R² by adding useless variables.
Best metric for comparing models with different numbers of features.

✅ 35. What is logistic regression? How does it differ from linear regression?

Logistic Regression is used for classification, especially binary classification (0/1, Yes/No).

It models the probability of belonging to class 1 using the logistic sigmoid function:

🔍 Key Differences Between Linear and Logistic Regression

Feature	Linear Regression	Logistic Regression
Output	Continuous value	Probability (0–1)
Use Case	Regression	Classification
Loss Function	Mean Squared Error (MSE)	Log Loss (Cross-Entropy)
Model Form	Straight line	S-shaped curve (sigmoid)

📌 Python Example – Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np

# Generate synthetic classification data
X, y = make_classification(
    n_features=1, 
    n_samples=100, 
    n_informative=1, 
    n_redundant=0,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict probabilities
probs = model.predict_proba(X_test)

print("Predicted Probabilities:\n", probs)

✅ Sample Output (Realistic Example)

Predicted Probabilities:
 [[0.015 0.985]
  [0.987 0.013]
  [0.996 0.004]
  [0.112 0.888]
  [0.021 0.979]
  [0.720 0.280]
  [0.003 0.997]
  [0.451 0.549]
  [0.998 0.002]
  [0.875 0.125]
  [0.640 0.360]
  [0.002 0.998]
  [0.953 0.047]
  [0.820 0.180]
  [0.110 0.890]
  [0.007 0.993]
  [0.912 0.088]
  [0.034 0.966]
  [0.985 0.015]
  [0.006 0.994]]

✔ Each row = `[P(class 0), P(class 1)]`

Example:
[0.015, 0.985] → The model is 98.5% sure the class is 1

✅ 36. Explain the concept of odds ratio in logistic regression.

In logistic regression, the model predicts odds, not raw probabilities.

Odds =

Odds Ratio (OR):

The odds ratio tells you how the odds of the outcome change for a 1-unit increase in the predictor.

✅ 37. Difference between correlation and causation

Concept	Meaning
Correlation	Two variables move together; statistical relationship
Causation	One variable directly causes a change in another

⭐ Key Point

Correlation does NOT imply causation.

Example:

Ice cream sales ↑ and drowning incidents ↑
➡ correlated
➡ not causal
➡ both caused by hot weather

✅ 38. How to interpret Pearson’s correlation coefficient

Pearson’s r measures strength & direction of a linear relationship. −1≤r≤1-1 \le r \le 1−1≤r≤1

Interpretation Scale:

r range	Strength
0.00–0.19	Very weak
0.20–0.39	Weak
0.40–0.59	Moderate
0.60–0.79	Strong
0.80–1.00	Very strong

Example:

If r = 0.72 → strong positive correlation

✅ 39. What is Spearman’s rank correlation?

Spearman’s ρ measures monotonic (increasing/decreasing) relationships.

Used when:

✔ Data is ordinal
✔ Relationship is nonlinear but monotonic
✔ There are outliers (Spearman is robust)

Range:

Same as Pearson: -1 to +1.

Python Example

from scipy.stats import spearmanr

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

corr, p_value = spearmanr(x, y)
print(f"Spearman Correlation: {corr:.2f}, p-value: {p_value:.4f}")

output:- Spearman Correlation: 1.00, p-value: 0.0000

✅ 40. Concept of residuals in regression analysis

Residual = difference between actual and predicted value:

Purpose of residuals:

✔ Check model fit
✔ Detect violations of assumptions
✔ Identify outliers
✔ Identify patterns (non-linearity, heteroscedasticity)

Good model:

Residuals should be:

Randomly scattered
No patterns
Constant spread

Python Example – Residual Plot

(Your seaborn code is correct; here is the refined version.)

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1.1, 1.9, 3.0, 4.1, 5.1])

# Fit model
model = LinearRegression()
model.fit(X, y)

preds = model.predict(X)
residuals = y - preds

# Residual plot
sns.residplot(x=preds, y=residuals, lowess=True)
plt.title("Residual Plot")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

✅ 41. What is Random Sampling? Why is it important?

Definition:

Random sampling is a sampling technique where every individual in the population has an equal chance of being selected.

Why is Random Sampling Important?

✔ Reduces Bias – no systematic favoring of certain groups.
✔ Representative Sample – increases accuracy of estimates.
✔ Generalizability – allows results to extend to the entire population.
✔ Valid Statistical Inference – forms the basis of probability theory.

Example:

Estimating average height of students in a school by selecting students randomly instead of volunteers.

Python Example

import random 

# Example population
population = list(range(1, 1001))  # Students numbered 1 to 1000

# Simple random sample of size 50
sample = random.sample(population, 50)
print(sample)

✅ 42. Differentiate Between Stratified and Cluster Sampling

Feature	Stratified Sampling	Cluster Sampling
Division Basis	Population divided into homogeneous groups (strata)	Population divided into heterogeneous groups (clusters)
Sampling Method	Sample taken from each stratum	Entire clusters selected
Goal	Improve precision	Reduce cost & logistics
Homogeneity	Strata → similar within	Clusters → varied within
Efficiency	Higher accuracy	More practical, cheaper

Examples

Stratified Sampling:
Divide population by age groups and sample from each group.
Cluster Sampling:
Select 5 towns (clusters) and survey all residents in selected towns.

✅ 43. What is Sampling Bias? How Can It Be Minimized?

Definition:

Sampling bias occurs when some members of the population are more likely to be selected, leading to an unrepresentative sample.

Common Causes:

Convenience sampling
Voluntary response bias
Undercoverage (missing groups)
Non-response bias

How to Minimize Sampling Bias

✔ Use random sampling techniques
✔ Ensure full population coverage
✔ Apply appropriate sample weighting
✔ Pilot-test sampling methods
✔ Use stratification if needed

Example:

Political surveys using landline phones miss younger mobile-only users → undercoverage bias.

✅ 44. Define Sampling Distribution

Definition:

A sampling distribution is the probability distribution of a statistic (mean, proportion, variance) computed from all possible samples of the same size.

Key Properties:

Shows how a statistic varies across samples
Used for estimating population parameters
With large sample sizes, sampling distribution → normal (Central Limit Theorem)

Example:

Repeatedly take samples of size 100, compute the mean each time, and plot their distribution.

Python Example – Sampling Distribution of the Mean

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Population: skewed (exponential distribution)
population = np.random.exponential(scale=2.0, size=10000)

# Sampling distribution: take 1000 samples of size 100
sample_means = [np.mean(np.random.choice(population, size=100)) 
                for _ in range(1000)]

plt.hist(sample_means, bins=30, edgecolor='black')
plt.title('Sampling Distribution of the Mean')
plt.xlabel('Sample Mean')
plt.ylabel('Frequency')
plt.show()

45. What is the Law of Large Numbers (LLN)?

The Law of Large Numbers (LLN) states that as the sample size increases, the sample mean gets closer to the true population mean.

Types

Weak LLN:
Convergence in probability toward the mean.
Strong LLN:
Convergence almost surely toward the mean.

Importance

Forms the basis of statistical inference.
Ensures that large samples produce more accurate estimates.

Example

Flipping a fair coin repeatedly:
As the number of flips increases, the proportion of heads approaches 0.5.

Python Example (Visualization)

import numpy as np
import matplotlib.pyplot as plt

# Simulating coin flips
flips = np.random.binomial(n=1, p=0.5, size=1000)

# Cumulative proportion of heads
cumulative_heads = np.cumsum(flips)
proportions = cumulative_heads / np.arange(1, 1001)

# Plot
plt.plot(proportions)
plt.axhline(y=0.5, linestyle="--")
plt.title("Proportion of Heads Over Time (LLN)")
plt.xlabel("Number of Flips")
plt.ylabel("Proportion of Heads")
plt.show()

46. How Do You Determine an Appropriate Sample Size?

The required sample size depends on:

Confidence level (commonly 95%)
Margin of error (E) — acceptable error limit
Population standard deviation (σ)
Z-score for the confidence level
Effect size (for hypothesis tests)

Formula

from scipy.stats import norm

confidence_level = 0.95
z_score = norm.ppf((1 + confidence_level) / 2)
sigma = 10
margin_of_error = 2

n = (z_score * sigma / margin_of_error)**2
print("Required sample size:", int(n))

output:- Required sample size: 96

47. Observational vs Experimental Studies

Feature	Observational Study	Experimental Study
Intervention	No	Yes
Causation	Cannot determine	Can determine
Control	None	Full control
Example	Surveys, cohort studies	Clinical trials, A/B testing

Examples

Observational: Studying link between coffee consumption and heart disease by observing habits.
Experimental: Giving randomly assigned groups coffee to measure effect on heart health.

48. Control Group vs Treatment Group

Control Group

Does not receive treatment.
Acts as baseline.

Treatment Group

Receives the treatment under study.

Purpose

Compare outcomes to measure treatment effect.
Control confounding variables.

Example

Drug trial:

Control → placebo
Treatment → actual drug

49. What is the Placebo Effect?

The placebo effect occurs when people experience improvement simply because they believe the treatment works—even if it has no real effect.

Use in Experiments

Helps prevent bias
Ensures psychological expectations don’t influence outcomes

Example

Giving participants sugar pills; many report reduced pain due to belief.

50. Importance of Randomization in Experiments

Randomization means assigning participants to groups randomly.

Why It Matters

Balances confounders
Ensures group independence
Supports causal inference
Prevents selection bias

Python Example

import random

participants = list(range(1, 101))  # 100 participants

random.shuffle(participants)

treatment_group = participants[:50]
control_group = participants[50:]

print("Treatment Group:", treatment_group)
print("Control Group:", control_group)

51. What is the Bias–Variance Tradeoff?

The bias-variance tradeoff describes how model performance is affected by two types of errors:

Bias

Error due to overly simple assumptions.
Causes underfitting.
Model cannot capture the true pattern.

Variance

Error due to model sensitivity to training data.
Causes overfitting.
Model captures noise instead of signal.

52. Overfitting vs Underfitting

Overfitting

Model learns training data too well, including noise.
High accuracy on training set, poor on test set.
Happens with complex models or small datasets.

Underfitting

Model is too simple.
Performs poorly on both training and test sets.
Fails to capture underlying patterns.

Examples

Degree-10 polynomial → overfits
Straight line on nonlinear curve → underfits

Visualization (Python Code)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate sample data
np.random.seed(0)
X = np.sort(np.random.rand(20))
y = np.sin(2 * np.pi * X) + np.random.randn(20) * 0.1

# Fit models of varying degrees
degrees = [1, 4, 10]
plt.figure(figsize=(14, 5))

for i, degree in enumerate(degrees):
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X.reshape(-1, 1))
    model = LinearRegression().fit(X_poly, y)
    y_pred = model.predict(X_poly)

    plt.subplot(1, 3, i+1)
    plt.scatter(X, y, label='Data')
    plt.plot(X, y_pred, color='red', label=f'Degree {degree}')
    plt.title(f"Degree {degree}")
    plt.legend()

plt.tight_layout()
plt.show()

Interpretation

Degree 1 → Underfitting
Degree 10 → Overfitting

53. What is Regularization? (L1, L2)

Regularization reduces overfitting by adding a penalty to large coefficients.

L1 Regularization (Lasso)

Adds |weights| penalty.
Produces sparse models (sets some weights to 0).
Good for feature selection.

L2 Regularization (Ridge)

Adds weights² penalty.
Shrinks coefficients but does not make them zero.
Great for multicollinearity.

Python Example

from sklearn.linear_model import Lasso, Ridge, ElasticNet

# Sample data
X = [[1], [2], [3]]
y = [2, 4, 6]

# Lasso (L1)
lasso = Lasso(alpha=0.1).fit(X, y)

# Ridge (L2)
ridge = Ridge(alpha=0.1).fit(X, y)

# ElasticNet (L1 + L2)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)

print("Lasso Coefficients:", lasso.coef_)
print("Ridge Coefficients:", ridge.coef_)
print("ElasticNet Coefficients:", enet.coef_)

Lasso Coefficients: [1.85]
Ridge Coefficients: [1.9047619]
ElasticNet Coefficients: [1.79069767]

54. What is Cross-Validation? Why is it Used?

Cross-validation tests how well a model generalizes to unseen data.

Why We Use Cross-Validation

More reliable performance estimate
Stable comparison of models
Reduces dependency on one train-test split
Helps hyperparameter tuning

Types of Cross-Validation

1. K-Fold Cross-Validation

Split data into k parts.
Train on k−1 parts, test on remaining part.
Repeat k times.

2. Stratified K-Fold

Ensures class proportions remain consistent.

3. Leave-One-Out (LOO)

Each sample acts as a test case once.
High computational cost.

Python Example

from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

X = [[1], [2], [3], [4], [5]]
y = [2, 4, 6, 8, 10]

model = LinearRegression()
kf = KFold(n_splits=5)

scores = cross_val_score(model, X, y, cv=kf, scoring='r2')

print("Cross-validated R² scores:", scores)
print("Mean R² score:", scores.mean())

output:- Cross-validated R² scores: [nan nan nan nan nan]
Mean R² score: nan

55. What is Bootstrapping?

Bootstrapping is a resampling technique used to estimate statistics when population data is unavailable.

Why Use Bootstrapping?

Create confidence intervals
Estimate variability of statistics
Used in ensemble methods (Bagging, Random Forest)

How Bootstrapping Works

Sample with replacement from dataset.
Compute statistic (mean, median, etc.).
Repeat many times.
Build distribution of statistic.

Python Example

import numpy as np
import matplotlib.pyplot as plt

# Original data
data = np.random.exponential(scale=2.0, size=1000)

# Bootstrap
n_bootstraps = 1000
bootstrap_means = []

for _ in range(n_bootstraps):
    sample = np.random.choice(data, size=len(data), replace=True)
    bootstrap_means.append(np.mean(sample))

# Plot distribution of bootstrap means
plt.hist(bootstrap_means, bins=30, edgecolor='black')
plt.title('Bootstrap Distribution of Mean')
plt.xlabel('Mean')
plt.ylabel('Frequency')
plt.show()

56. What is the Purpose of the Bonferroni Correction?

When multiple hypothesis tests are performed, the chance of getting at least one false positive (Type I error) increases.

Purpose

The Bonferroni correction adjusts the significance threshold to control the family-wise error rate (FWER).

Formula

If you run n tests with significance level α:

Pros

Simple
Very conservative (good for high-risk studies)

Cons

Too strict → increases Type II errors (false negatives)

Use Cases

Medical trials
Genomics (thousands of hypothesis tests)
Psychology studies

57. Define Heteroscedasticity. How Does It Affect Regression Models?

Heteroscedasticity

Occurs when the variance of residuals is not constant across observations.

Opposite: Homoscedasticity (constant variance).

Effects on Regression

OLS coefficient estimates → unbiased
Standard errors → biased
→ unreliable t-tests, F-tests, confidence intervals, p-values
Loss of efficiency in OLS estimation

Detection Methods

Residuals vs fitted plot
Breusch-Pagan test
White test
Goldfeld-Quandt test

Fixes / Remedies

Use robust standard errors
Transform variable (log, square root)
Weighted least squares (WLS)

Example Code: Residual Plot

import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

# Generate data
X = np.random.rand(100)
y = 2 * X + np.random.normal(scale=X*0.5, size=100)  # heteroscedastic noise

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

# Plot residuals
plt.scatter(model.fittedvalues, model.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.title("Residuals vs Fitted")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

58. What is Autocorrelation? How is it Detected?

Autocorrelation

Autocorrelation is the correlation of a time series with its own past values.

Causes

Trends
Seasonality
Cyclical patterns
Structural patterns

Problems in Regression

Autocorrelation violates the assumption that residuals are independent.

Effects:

Biased standard errors
Incorrect hypothesis tests
Overestimation of R²
Inefficient model estimates

Detection Methods

ACF (Autocorrelation Function) plot
Durbin-Watson test
Ljung–Box test
PACF plot

Example Code

from statsmodels.graphics.tsaplots import plot_acf
import numpy as np
import matplotlib.pyplot as plt

# Simulate autocorrelated data (random walk)
data = np.cumsum(np.random.normal(size=100))

# ACF plot
plot_acf(data, lags=20)
plt.title("Autocorrelation Plot")
plt.show()

59. Explain the Concept of Stationarity in Time Series Analysis

A time series is stationary when its statistical properties do not change over time.

Properties of Weak (Second-order) Stationarity

Constant mean
Constant variance
Autocovariance depends only on lag, not time

Why It Matters

Most models like AR, MA, ARIMA, SARIMA assume stationarity.

If data is non-stationary, forecasts become unreliable.

Tests for Stationarity

ADF (Augmented Dickey–Fuller) Test
KPSS Test
Phillips–Perron Test

Example Code (ADF Test)

from statsmodels.tsa.stattools import adfuller
import numpy as np

# Example time series
data = np.cumsum(np.random.normal(size=500))  # non-stationary

result = adfuller(data)
print("ADF Statistic:", result[0])
print("p-value:", result[1])
print("Critical Values:", result[4])

output:-
ADF Statistic: -1.4907487351012985
p-value: 0.538082742411247
Critical Values: {'1%': np.float64(-3.4435228622952065), '5%': np.float64(-2.867349510566146), '10%': np.float64(-2.569864247011056)}

Interpretation:

p-value < 0.05 → reject null → series is stationary
p-value > 0.05 → non-stationary

60. What is the Difference Between AR, MA, and ARIMA Models?

AR (Autoregressive) Model

MA (Moving Average) Model

ARIMA (Autoregressive Integrated Moving Average)

Example Code: Fit ARIMA(1,1,1)

from statsmodels.tsa.statespace.sarimax import SARIMAX
import numpy as np

# Example data (random walk)
data = np.cumsum(np.random.normal(size=200))

model = SARIMAX(data, order=(1,1,1))
results = model.fit(disp=False)

print(results.summary())

✅ 61. What is a Time Series? Provide an Example

A time series is a sequence of observations collected or recorded at regular time intervals (daily, hourly, monthly, yearly, etc.).

Key Characteristics

Observations are ordered in time.
Used for forecasting, trend analysis, and pattern detection.
Time dependence (today’s value influences tomorrow’s value).

Example

Daily closing stock price of Apple for 5 years.

Corrected Python Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample synthetic time series data
dates = pd.date_range(start='2023-01-01', periods=100, freq='D')
values = pd.Series(
    np.sin(2 * np.pi * dates.dayofyear / 365) + np.random.normal(0, 0.1, size=100),
    index=dates
)

# Plotting
plt.plot(values)
plt.title('Sample Time Series')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True)
plt.show()

✅ 62. Explain the Components of a Time Series

A time series has four main components:

1. Trend

Long-term upward or downward movement.

2. Seasonality

Regular repeating patterns within fixed periods
(e.g., daily, weekly, monthly, yearly).

3. Cyclical

Irregular periodic fluctuations (economic cycles)
— longer duration than seasonality.

4. Random / Noise

Unpredictable variations.

Python Code

from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(values, model='additive')
result.plot()
plt.show()

✅ 63. What is Seasonality in Time Series Data?

Seasonality is a pattern that repeats regularly at specific intervals.

Examples

Retail sales increase in December.
Electricity usage rises in summer due to AC.
Website traffic peaks on weekends.

How to Detect Seasonality

Line plots
ACF (Autocorrelation Function)
Seasonal Decomposition (STL)

✅ 64. How Do You Test for Stationarity?

A series is stationary if:

Mean is constant
Variance is constant
Autocovariance doesn’t depend on time

Two Common Tests

1. ADF Test (Augmented Dickey-Fuller)

H₀ (null): series is non-stationary
H₁: series is stationary

2. KPSS Test

H₀: series is stationary
H₁: series is non-stationary

Python Code

from statsmodels.tsa.stattools import adfuller, kpss

def adf_test(series):
    result = adfuller(series)
    print('ADF Statistic:', result[0])
    print('p-value:', result[1])
    print('Critical Values:', result[4])

def kpss_test(series):
    result = kpss(series, regression='c')
    print('KPSS Statistic:', result[0])
    print('p-value:', result[1])
    print('Critical Values:', result[3])

adf_test(values)
kpss_test(values)

output:-

ADF Statistic: -2.1499168088123657
p-value: 0.2249219358296814
Critical Values: {'1%': np.float64(-3.50434289821397), '5%': np.float64(-2.8938659630479413), '10%': np.float64(-2.5840147047458037)}
KPSS Statistic: 1.7085205514598798
p-value: 0.01
Critical Values: {'10%': 0.347, '5%': 0.463, '2.5%': 0.574, '1%': 0.739}

✅ 65. What is Differencing in Time Series Analysis?

Differencing is used to make a time series stationary by removing trends or seasonality.

First-order differencing

Purpose

Remove trend
Remove seasonality
Stabilize mean/variance

Python Code

diff_values = values.diff().dropna()

plt.plot(diff_values)
plt.title('First Order Differenced Time Series')
plt.show()

✅ 66. Explain the Concept of Lag in Time Series

A lag is a previous value of a time series, shifted by k time steps.

Lag k means:

Why Lags Are Used

Identify temporal dependencies
Build features for ML forecasting (lag features)
Compute autocorrelation (ACF)
Create AR, MA, ARIMA models

Example Code

df = pd.DataFrame({
    'Original': values,
    'Lag_1': values.shift(1)
})

print(df.head())

            Original     Lag_1
2023-01-01  0.093207       NaN
2023-01-02  0.047461  0.093207
2023-01-03  0.100728  0.047461
2023-01-04  0.175987  0.100728
2023-01-05  0.173559  0.175987

✅ 67. What is the Purpose of Autocorrelation and Partial Autocorrelation Plots?

ACF and PACF help determine ARIMA model parameters.

📌 Autocorrelation Function (ACF)

Shows correlation between a value and its lagged values.
Identifies MA(q) component.
Spikes in ACF → significant lags.

📌 Partial Autocorrelation Function (PACF)

Shows direct correlation after removing intermediate lags.
Identifies AR(p) component.

Python Code

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

plot_acf(values, lags=20)
plt.show()

plot_pacf(values, lags=20)
plt.show()

✅ 68. Describe the Box-Jenkins Methodology

A structured approach for building ARIMA/SARIMA models.

1. Model Identification

Use ACF/PACF
Determine differencing d for stationarity
Identify AR(p) and MA(q)

2. Parameter Estimation

Fit ARIMA using maximum likelihood estimation (MLE)

3. Diagnostic Checking

Residuals should:

Be random (white noise)
No autocorrelation
Be normally distributed

4. Forecasting

After model validation, predict future values.

Python Example (ARIMA using SARIMAX)

from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt

model = SARIMAX(values, order=(1,1,1))
results = model.fit(disp=False)
print(results.summary())

# Forecast next 10 steps
forecast = results.get_forecast(steps=10)
pred_ci = forecast.conf_int()
predictions = forecast.predicted_mean

plt.figure(figsize=(10,5))
plt.plot(values.index, values, label='Observed')
plt.plot(predictions.index, predictions, label='Forecast')
plt.fill_between(pred_ci.index, pred_ci.iloc[:,0], pred_ci.iloc[:,1], alpha=0.2)
plt.legend()
plt.title("ARIMA Forecast")
plt.show()

                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:               SARIMAX(1, 1, 1)   Log Likelihood                  84.214
Date:                Tue, 09 Dec 2025   AIC                           -162.429
Time:                        11:31:05   BIC                           -154.643
Sample:                    01-01-2023   HQIC                          -159.279
                         - 04-10-2023                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.2834      0.125     -2.271      0.023      -0.528      -0.039
ma.L1         -0.5781      0.129     -4.481      0.000      -0.831      -0.325
sigma2         0.0106      0.001      7.805      0.000       0.008       0.013
===================================================================================
Ljung-Box (L1) (Q):                   1.51   Jarque-Bera (JB):                 3.90
Prob(Q):                              0.22   Prob(JB):                         0.14
Heteroskedasticity (H):               0.78   Skew:                             0.48
Prob(H) (two-sided):                  0.48   Kurtosis:                         3.09
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

✅ 69. What is Exponential Smoothing?

A technique that applies exponentially decreasing weights to past observations.

Recent data → higher weight
Older data → lower weight

Types

Simple Exponential Smoothing (SES)
For data with no trend, no seasonality.
Holt’s Linear Trend
Handles trend.
Holt–Winters’ Seasonal Method
Handles trend + seasonality (additive or multiplicative).

Python Code

from statsmodels.tsa.holtwinters import SimpleExpSmoothing
import matplotlib.pyplot as plt

model = SimpleExpSmoothing(values)
fit = model.fit(smoothing_level=0.2, optimized=False)

fitted_values = fit.fittedvalues

plt.plot(values, label='Actual')
plt.plot(fitted_values, label='Smoothed')
plt.legend()
plt.title('Exponential Smoothing')
plt.show()

✅ 70. How Do You Evaluate the Accuracy of a Time Series Model?

Several common evaluation metrics:

1. MAE – Mean Absolute Error

✅ Example

import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Assume values is your time series
data = values.copy()

# 1. Train-test split
train_size = int(len(data) * 0.8)
train, test = data[:train_size], data[train_size:]

# 2. Fit ARIMA model
model = ARIMA(train, order=(1,1,1))
model_fit = model.fit()

# 3. Predict for length of test set
predicted = model_fit.forecast(steps=len(test))

# 4. Evaluation metrics
mae = mean_absolute_error(test, predicted)
mse = mean_squared_error(test, predicted)
rmse = np.sqrt(mse)
mape = np.mean(np.abs((test - predicted) / test)) * 100

print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")

output:

MAE: 0.07
MSE: 0.01
RMSE: 0.09
MAPE: 6.53%

📌 Optional: Plot Actual vs Predicted

import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
plt.plot(test.index, test, label='Actual')
plt.plot(test.index, predicted, label='Predicted')
plt.legend()
plt.title("Actual vs Predicted")
plt.show()

✅ 71. What is Bayesian Inference?

Bayesian inference is a statistical method that updates the probability of a hypothesis (parameter) as new evidence or data becomes available.

It is based on Bayes’ Theorem, which reverses conditional probability:

Key Idea

Bayesian inference does not give a single estimate.
Instead, it gives a distribution of possible values with uncertainties.

Example output:
“θ is likely between 0.3 and 0.6 with 95% probability.”

✅ 72. Define Prior, Likelihood, and Posterior

Term	Description
Prior P(θ)P(\theta)P(θ)	Distribution expressing beliefs about a parameter before seeing data.
Likelihood (P(D	\theta))
Posterior (P(\theta	D))

These three components form the backbone of Bayesian inference.

✅ 73. How Does Bayesian Statistics Differ from Frequentist Statistics?

Aspect	Frequentist	Bayesian
Parameter View	Fixed but unknown	Random variable with distribution
Probability	Long-run frequency	Degree of belief
Goal	Point estimate (e.g., MLE)	Posterior distribution
Uncertainty	Confidence intervals	Credible intervals
Use of Prior	No prior used	Prior always used
Interpretation	Population-based	Belief-based

Example Analogy

Frequentist:
“If we repeat the experiment many times, 95% of the confidence intervals will contain the true value.”
Bayesian:
“Given the data, there is a 95% probability the true value lies inside this credible interval.”

✅ 74. What is a Conjugate Prior?

A conjugate prior is a prior distribution that, when combined with a likelihood, results in a posterior from the same distribution family.

This allows closed-form Bayesian updating.

Examples

Likelihood	Conjugate Prior	Posterior
Binomial	Beta	Beta
Normal (σ² known)	Normal	Normal
Poisson	Gamma	Gamma

Example: Coin Flips (Binomial + Beta)

from scipy.stats import beta, binom 

# Prior: Beta(2, 2)
a_prior, b_prior = 2, 2

# Observed data: 6 heads out of 10 flips
heads, trials = 6, 10

# Posterior: Beta(alpha + heads, beta + failures)
a_post, b_post = a_prior + heads, b_prior + (trials - heads)

print(f"Posterior: Beta({a_post}, {b_post})")

Posterior: Beta(8, 6)

Posterior = Beta(8, 6)
→ Updated belief after seeing evidence.

✅ 75. Explain the Concept of Markov Chain Monte Carlo (MCMC)

MCMC is a class of algorithms used to sample from complex probability distributions, especially when the posterior cannot be computed analytically.

Why do we need MCMC?

Posteriors are often high-dimensional
Priors and likelihoods may be non-conjugate
Posterior integrals cannot be solved analytically

Key Idea

Build a Markov chain whose equilibrium distribution = target posterior
After enough steps (burn-in), samples represent the true posterior

Common MCMC Algorithms

✔ 1. Metropolis–Hastings

Propose a new sample.
Accept/reject based on acceptance ratio.

✔ 2. Gibbs Sampling

Sample each variable from its conditional distribution.
Works when conditional distributions are known.

What MCMC Produces

A set of samples that approximate the posterior:

From these samples, we can compute:

Mean parameter values
Credible intervals
Posterior predictive distributions

✅ 76. What is the Purpose of the Gibbs Sampling Algorithm?

Gibbs Sampling is an MCMC algorithm used to generate samples from a joint posterior distribution when direct sampling is difficult.

Purpose

To sample from a multivariate posterior by repeatedly sampling one parameter at a time from its conditional distribution

When to Use

Joint distribution is hard to sample
Conditional distributions are known & easy to sample
Works very well when distributions are conjugate

Pros

Efficient in high-dimensional models
No proposal tuning needed (unlike Metropolis-Hastings)
Often faster mixing when parameters are conditionally independent

Cons

Slow if variables are highly correlated
Requires closed-form conditional distributions

✅ 77. Describe the Metropolis–Hastings Algorithm

Metropolis–Hastings is a general-purpose MCMC algorithm used to sample from complex posterior distributions.

Python Example (Sampling from N(0,1))

import numpy as np
from scipy.stats import norm

def metropolis_hastings(log_posterior, n_samples=1000):
    theta = 0
    samples = [theta]
    
    for _ in range(n_samples):
        theta_proposal = norm.rvs(theta, 1)

        log_p_current = log_posterior(theta)
        log_p_proposal = log_posterior(theta_proposal)

        ratio = np.exp(log_p_proposal - log_p_current)

        if np.random.rand() < ratio:
            theta = theta_proposal

        samples.append(theta)

    return samples

# Example: N(0,1)
log_posterior = lambda x: -0.5 * x**2
samples = metropolis_hastings(log_posterior, n_samples=5000)

Visualization (from prompt):

Histogram of MH samples
True N(0,1) curve

✅ 78. What Are Credible Intervals in Bayesian Statistics?

A credible interval is a range of values within which a parameter lies with a given probability, based on its posterior distribution.

Example:
“A 95% credible interval means there is a 95% probability that the true parameter lies within this interval.”

Types

Highest Posterior Density (HPD)
Shortest interval containing 95% of posterior mass.
Equal-tailed Interval
Removes 2.5% from both tails of the posterior.

Python Example (HPD using ArviZ)

import arviz as az
import numpy as np

posterior_samples = np.random.normal(0, 1, size=10000)
hpd_interval = az.hdi(posterior_samples, hdi_prob=0.95)

print("95% HPD Interval:", hpd_interval)

✅ 79. How Is Bayesian Updating Performed?

Bayesian updating means sequentially updating beliefs (posterior) as new data arrives.

Process

Start with a prior
Observe data → compute posterior
Posterior becomes the new prior
Repeat when new data comes

This is used in:

Machine learning
Online learning
Real-time parameter estimation

Example: Updating a Beta Prior for Coin Flips

from scipy.stats import beta

a, b = 1, 1  # Beta(1,1) prior

for flip in ['H', 'T', 'H', 'H']:
    print(f"Before flip '{flip}': Beta({a}, {b})")
    
    if flip == 'H':
        a += 1
    else:
        b += 1
        
    print(f"After flip '{flip}': Beta({a}, {b})\n")

Each flip updates the posterior incrementally — this is Bayesian learning.

✅ 80. Provide a Real-World Application of Bayesian Methods

Application: Medical Diagnosis & Disease Probability

Bayesian reasoning helps interpret test results considering base rates (prevalence).

Let:

Interpretation

Even with a 95% accurate test, the true chance of having the disease is only ~16% after a positive result.

Reason: Low prevalence (base rate fallacy).

✅ 81. How Do You Choose the Appropriate Chart for Data Visualization?

Choosing the correct chart depends on:

1. Type of Data

Categorical → Bar/ Pie
Numerical → Histogram/ Boxplot
Time series → Line chart

2. Purpose of Visualization

Purpose	Best Charts
Comparison	Bar chart, Column chart, Line chart
Distribution	Histogram, Boxplot, KDE (density) plot
Relationship	Scatter plot, Bubble chart, Heatmap
Composition	Pie chart, Stacked bar chart, Area chart
Trend Over Time	Line chart, Area chart

Examples

Sales comparison → Bar chart
Customer age distribution → Histogram or Boxplot
Temperature over months → Line chart

✅ 82. Difference Between a Bar Chart and a Histogram

Feature	Bar Chart	Histogram
Purpose	Compare categories	Show distribution of continuous data
X-axis	Categorical	Continuous (numeric bins)
Bars	Can be reordered	Cannot reorder (bins fixed)
Spacing	Bars separated	Bars touch each other

Bar Chart Example

categories = ['A', 'B', 'C']
values = [3, 7, 4]
plt.bar(categories, values)
plt.title('Bar Chart')
plt.show()

Histogram Example

data = np.random.randn(1000)
plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram')
plt.show()

✅ 83. Use of Scatter Plots in Identifying Relationships

Scatter plots show the relationship between two continuous variables.

What You Can Identify

Positive correlation → points go up-right
Negative correlation → points go down-right
No correlation → random cloud of points
Outliers
Clusters

Example

sns.scatterplot(x='total_bill', y='tip', data=tips)
plt.title('Tip vs Total Bill')
plt.show()

✅ 84. What is a Heatmap? When is it Useful?

A heatmap uses color to represent values in a matrix.

Useful For

Correlation matrices
Visualizing large numeric tables
Understanding intensity across rows × columns
Visualizing missing values (NaN heatmap)

Example: Correlation Heatmap

corr = iris.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

✅ 85. How Do You Detect Outliers in a Dataset?

3 Common Methods

1️⃣ IQR METHOD (most common)

Outlier if:

Outlier if:

Q1 = data['Values'].quantile(0.25)
Q3 = data['Values'].quantile(0.75)
IQR = Q3 - Q1

outliers = data[(data['Values'] < (Q1 - 1.5*IQR)) |
                (data['Values'] > (Q3 + 1.5*IQR))]

print(outliers)

2️⃣ Z-Score Method

Outlier if:

data['z'] = (data['Values'] - data['Values'].mean()) / data['Values'].std()
outliers = data[data['z'].abs() > 3]

3️⃣ Boxplot Visualization

plt.boxplot(data['Values'])
plt.title('Boxplot - Outlier Detection')
plt.show()

✅ 86. What is the Purpose of a Q-Q Plot?

A Q-Q plot (Quantile–Quantile plot) compares the quantiles of a dataset with the quantiles of a theoretical distribution (usually the normal distribution).

Purpose

Check if data follows a specific distribution (normality test)
Detect:
- Skewness
- Heavy tails
- Kurtosis issues
- Outliers

Interpretation

Points on the straight line → data is approximately normal
S-shaped curve → skewness
Curved ends → heavy/light tails
Extreme deviations → outliers

Code Example

import numpy as np
import statsmodels.graphics.gofplots as smg
import matplotlib.pyplot as plt

# Generate skewed data
data = np.random.exponential(size=100)

smg.qqplot(data, line='s')
plt.title('Q-Q Plot - Checking Normality')
plt.show()

✅ 87. Importance of Data Normalization

Normalization rescales values to a fixed range, usually [0, 1].

Why It Is Important

Models like KNN, SVM, Logistic Regression, Neural Networks are sensitive to feature scale
Prevents one feature from dominating others
Improves gradient descent convergence speed
Helps distance-based algorithms work correctly

Example

from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[1], [2], [3], [10]])
scaler = MinMaxScaler()

normalized = scaler.fit_transform(data)
print("Normalized Data:\n", normalized)

✅ 88. Difference Between Normalization and Standardization

Code Example

from sklearn.preprocessing import StandardScaler

standardized = StandardScaler().fit_transform(data)
print("Standardized Data:\n", standardized)

✅ 89. How Do You Handle Missing Data?

Common Strategies

✔ 1. Remove Missing Data

Drop rows → df.dropna()
Drop columns with too many missing values

✔ 2. Impute Missing Values

Mean/Median for numerical data
Mode for categorical data
Regression / ML-based imputation
KNN imputation
Interpolation for time-series

Code Example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': [5, np.nan, np.nan],
    'C': [1, 2, 3]
})

# Drop rows with any NaN
df_dropped = df.dropna()
print("After dropping rows:\n", df_dropped)

# Impute with mean
df_imputed = df.fillna(df.mean(numeric_only=True))
print("After imputing with mean:\n", df_imputed)

✅ 90. Implications of Imbalanced Datasets

An imbalanced dataset means one class has many more samples than another (e.g., fraud detection, medical diagnoses).

Problems

Model becomes biased toward majority class
High accuracy but poor minority detection
Fails on rare but important events

Solutions

✔ 1. Resampling

Oversampling minority class (e.g., SMOTE)
Undersampling majority class

✔ 2. Class Weights

Give higher penalty to minority class misclassification

✔ 3. Use Better Metrics

Precision
Recall
F1-score
ROC-AUC

Example (Class-Weighted Logistic Regression)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from imblearn.datasets import make_imbalance
from sklearn.datasets import make_classification

# Create synthetic imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=15,
    n_redundant=5, random_state=42
)

X, y = make_imbalance(X, y, sampling_strategy={0: 900, 1: 100}, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

pred = model.predict(X_test)
print(classification_report(y_test, pred))

✅ 91. How Would You Design an A/B Test for a New Website Feature?

An A/B test compares two versions (A = control, B = treatment) to see which performs better.

Steps to Design an A/B Test

Define the Objective
Example: Increase sign-ups by 10%.
Formulate Hypothesis
- H₀: No difference between A and B
- H₁: B improves conversion rate
Choose Metrics
- Primary metric: conversion rate
- Secondary: CTR, bounce rate, session duration
Random Assignment
Randomly split users into A and B to avoid bias.
Calculate Sample Size
Use statistical power analysis.
Run the Test
Ensure:
- Test duration is sufficient
- No overlapping experiments
- Stable traffic
Analyze Results
- Use z-test, t-test, or chi-square
- Check confidence intervals
- Ensure practical + statistical significance

Python Example

from statsmodels.stats.power import zt_ind_solve_power
import numpy as np

baseline_rate = 0.05
desired_improvement = 0.01  # 1%

effect_size = desired_improvement / np.sqrt(baseline_rate * (1 - baseline_rate))

required_sample = zt_ind_solve_power(
    effect_size=effect_size, alpha=0.05, power=0.8, ratio=1
)

print(f"Required sample per group: {int(required_sample)}")

✅ 92. What Metrics Would You Consider to Evaluate an A/B Test?

Metric	Purpose
Conversion Rate	Main success metric (sign-ups, purchases)
CTR	Measures click engagement
Average Time on Page	Indicates engagement depth
Revenue per User	Measures monetary effect
Bounce Rate	Shows user dissatisfaction
Retention Rate	Long-term user behavior

Important

Choose one primary metric
Use secondary metrics to detect unexpected negative effects

✅ 93. How Do You Handle Confounding Variables in an Experiment?

A confounder affects both the treatment and the outcome → makes the results unreliable.

Ways to Handle Confounders

Randomization
Random assignment distributes confounders evenly.
Stratification
Example: Split users by age group before randomizing.
Covariate Adjustment
Add confounders to a regression model.
Matched Pairing
Pair users with similar characteristics (age, gender, traffic source).
Use Control Groups
To isolate the effect of the new feature.

Example: Control for Age

import pandas as pd
import statsmodels.api as sm
import numpy as np

df = pd.DataFrame({
    'group': ['A', 'B'] * 50,
    'age': np.random.randint(18, 65, 100),
    'converted': np.random.choice([0, 1], p=[0.9, 0.1], size=100)
})

X = sm.add_constant(df[['group', 'age']])
y = df['converted']

model = sm.Logit(y, X).fit()
print(model.summary())

✅ 94. Describe a Situation Where You Had to Choose Between Precision and Recall

Example: Fraud Detection

Recall is more important
→ We must catch as many fraud cases as possible
→ Even if false positives increase

Why?

Missing a fraudulent transaction (false negative) is very costly
Flagging a legitimate user (false positive) is less costly

Trade-off

High precision + low recall → detect few frauds
High recall + lower precision → detect most frauds, but more false alerts

Balanced Metric

Use F1-score when both matter.

✅ 95. How Do You Assess the Performance of a Classification Model?

Common Evaluation Metrics

Metric	Meaning
Accuracy	% of correct predictions (bad for imbalanced data)
Precision	TP / (TP + FP) – correctness of positive predictions
Recall	TP / (TP + FN) – how many actual positives detected
F1-score	Harmonic mean of precision & recall
ROC-AUC	Ability to separate classes at all thresholds
Log Loss	Penalizes confident but wrong predictions

Code Example

from sklearn.metrics import classification_report, roc_auc_score

y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 1, 0, 0, 1]

print(classification_report(y_true, y_pred))
print("ROC-AUC:", roc_auc_score(y_true, y_pred))

✅ 96. What is a Confusion Matrix? How Is It Interpreted?

A confusion matrix is a table used to evaluate the performance of a classification model.
It shows the number of:

Actual \ Predicted	Positive	Negative
Positive	TP	FN
Negative	FP	TN

🔹 Interpretation

TP (True Positive): Model correctly predicts positive class
TN (True Negative): Model correctly predicts negative class
FP (False Positive): Model predicts positive but it is actually negative (Type I error)
FN (False Negative): Model predicts negative but it is actually positive (Type II error)

Code:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", cm)

✅ 97. Explain Precision, Recall, and F1-Score

Code:

from sklearn.metrics import precision_score, recall_score, f1_score

print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))

✅ 98. How Do You Handle Multicollinearity in Regression?

Multicollinearity occurs when independent variables are highly correlated.
This makes coefficient estimates unstable and unreliable.

🔍 How to Detect:

Correlation matrix
Variance Inflation Factor (VIF)
- VIF > 5 or 10 = high multicollinearity

🛠️ How to Fix:

Remove one of the correlated features
Use regularization (Ridge or Lasso)
Combine correlated variables using PCA
Domain-specific feature engineering

Code:

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

vif_data = pd.DataFrame()
vif_data["feature"] = df.columns
vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

print(vif_data)

✅ 99. Describe a Time When You Cleaned a Messy Dataset

Scenario:

You receive customer feedback data with:

Missing values
Duplicates
Inconsistent text
Extra spaces
Mixed data types

Steps You Took:

Load and inspect dataset
Handle missing values
Remove duplicate rows
Standardize text (strip, lowercase)
Convert columns to numeric
Create new features (e.g., length of feedback)

Code:

import pandas as pd

df = pd.read_csv('messy_data.csv')

df.drop_duplicates(inplace=True)

df.fillna({'feedback': ''}, inplace=True)

df['feedback'] = df['feedback'].str.strip().str.lower()

df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

print(df.head())

✅ 100. How Do You Communicate Statistical Findings to Non-Technical Stakeholders?

🔑 Best Practices:

Tell a clear story that connects data to business impact
Use visuals (bar charts, line charts) instead of raw tables
Avoid jargon — use simple language
Focus on insights, not equations
Provide recommendations, not just metrics

Example:

❌ Technical Explanation:
“Variant B has a statistically significant improvement in CTR (p < 0.05).”

✔ Business-friendly Explanation:
“Variant B increased click-through rate by 10%. If applied to all users, this could significantly increase engagement.”