General Data Analysis: Answers to 20 Questions
1. What is the role of a data analyst in a business?
Answer: A data analyst helps a business make better decisions by studying data. They collect, clean, and analyze data to find patterns or trends, then share these insights with the business team. For example, if a store wants to know which products sell best, a data analyst looks at sales data and tells the store manager, “Your top-selling product is sneakers, especially in winter.” They use tools like Excel, SQL, or Tableau to do this.
Example: At a coffee shop, a data analyst might analyze customer orders and find that most people buy lattes on weekends. They suggest offering a weekend latte discount to boost sales.
2. Explain the difference between data mining and data profiling.
Answer:
- Data Mining: This is like digging through data to find hidden patterns or trends. It uses advanced techniques (like machine learning) to discover new insights. For example, a supermarket might use data mining to find that people who buy bread also buy butter.
- Data Profiling: This is about checking the data to understand its quality and structure. It’s like taking a quick look at the data to see if it’s clean or has errors, like missing values or duplicates. For example, profiling might show that 10% of customer names in a database are blank.
Example:
- Data Mining: A movie streaming service finds that users who watch comedies also watch action movies.
- Data Profiling: The same service checks its database and finds some movie titles are misspelled or missing release dates.
3. What is data wrangling, and why is it important?
Answer: Data wrangling is the process of cleaning and organizing messy data so it’s ready for analysis. It’s like tidying up a messy room before you can use it. This includes fixing errors, filling in missing values, or combining data from different sources. It’s important because clean data leads to accurate results. If the data is messy, your analysis might be wrong.
Example: A company has sales data from two stores, but one store’s data has dates written as “Jan 2024” and the other as “01/2024.” A data analyst wrangles the data by making all dates the same format so they can compare sales easily.
4. Describe the steps in exploratory data analysis (EDA).
Answer: EDA is like exploring a new city to understand it before planning a trip. It’s the process of looking at data to understand its patterns and characteristics. The steps are:
- Understand the Data: Look at the data’s structure (e.g., columns, rows) and what each part means.
- Clean the Data: Fix errors like missing values or duplicates.
- Summarize the Data: Calculate basics like averages, maximums, or minimums.
- Visualize the Data: Create charts (like bar graphs or scatter plots) to spot trends.
- Find Patterns: Look for relationships, like “Do sales increase in winter?”
Example: For a restaurant’s sales data, you might:
- Check that the data has columns like “date,” “item,” and “price.”
- Remove rows with missing prices.
- Find the average daily sales.
- Plot a graph showing sales by day.
- Notice that pizza sells more on Fridays.
5. What is the difference between descriptive and predictive analytics?
Answer:
- Descriptive Analytics: This looks at what happened in the past. It summarizes data to show trends, like “How many cars were sold last year?”
- Predictive Analytics: This tries to guess what will happen in the future based on past data. It uses models to predict, like “How many cars will we sell next year?”
Example:
- Descriptive: A store finds that 500 laptops were sold in 2023.
- Predictive: The store uses 2023 data to predict they’ll sell 600 laptops in 2024 because sales are growing.
6. How do you ensure data quality in your analysis?
Answer: Data quality means the data is accurate, complete, and reliable. To ensure this:
- Check for Errors: Look for typos, duplicates, or wrong formats (e.g., a date like “32/13/2024”).
- Handle Missing Data: Fill in missing values or remove incomplete rows.
- Validate Data: Compare data with a trusted source (e.g., check if sales match receipts).
- Use Consistent Formats: Make sure all data follows the same rules (e.g., all prices in dollars).
- Document Changes: Keep track of what you fixed so others can trust your work.
Example: If analyzing customer data, you notice some ages are listed as “999.” You replace these with the average age or remove those rows to keep the data reliable.
7. What is data normalization, and when is it used?
Answer: Data normalization is like putting data on the same scale so it’s easier to compare. For example, if one column has values from 0 to 100 and another from 0 to 1,000,000, normalization adjusts them to a common range (like 0 to 1). It’s used when analyzing data with different scales, especially in machine learning or comparisons.
Example: A dataset has “salary” (from $20,000 to $200,000) and “hours worked” (from 10 to 50). Normalizing both to a 0-1 scale makes it easier to compare their impact on employee performance.
8. Explain the concept of data aggregation.
Answer: Data aggregation is combining data to summarize it. It’s like adding up all your weekly expenses to see your total spending. You group data (e.g., by month or region) and calculate things like sums, averages, or counts.
Example: A store has daily sales data: $100 on Monday, $150 on Tuesday, etc. Aggregating by week might show total sales of $700 for the week, making it easier to see trends.
9. What are the key differences between structured and unstructured data?
Answer:
- Structured Data: Organized in a clear format, like a table with rows and columns (e.g., a spreadsheet of customer names and purchases). It’s easy to search and analyze.
- Unstructured Data: Not organized in a specific format, like emails, videos, or social media posts. It’s harder to analyze because it doesn’t fit neatly into tables.
Example:
- Structured: A database with columns for “Customer ID,” “Name,” and “Order Date.”
- Unstructured: A folder of customer review videos or text comments on a website.
10. How do you handle large datasets that don’t fit into memory?
Answer: If a dataset is too big for your computer’s memory, you can:
- Use Smaller Chunks: Load the data in parts (e.g., 10,000 rows at a time) using tools like pandas in Python.
- Use Databases: Store data in a database (like SQL) and query only what you need.
- Sample the Data: Work with a smaller, representative subset of the data.
- Use Cloud Tools: Tools like Google BigQuery or AWS can handle huge datasets online.
Example: If analyzing 1 billion sales records, you might load 100,000 records at a time in Python or use a SQL database to process only the sales from 2024.
11. What is the difference between a data lake and a data warehouse?
Answer:
- Data Lake: A storage system that holds raw, unprocessed data in any format (like images, text, or tables). It’s like a big storage room for all kinds of data.
- Data Warehouse: A storage system for organized, processed data, usually in tables, ready for analysis. It’s like a neat library with sorted books.
Example:
- Data Lake: Stores raw customer tweets, sales logs, and website clicks.
- Data Warehouse: Stores cleaned sales data in tables with columns like “Date,” “Product,” and “Revenue.”
12. How do you validate the results of your data analysis?
Answer: Validating means making sure your results are correct. You can:
- Check Calculations: Recalculate key numbers manually or with a different tool.
- Compare with Known Data: Match your results to a trusted source (e.g., company reports).
- Use Visualizations: Look at charts to spot weird patterns (e.g., a sudden sales spike).
- Get Feedback: Ask a colleague to review your work.
- Test Assumptions: If you assumed something (e.g., no missing data), check if it’s true.
Example: If you calculate average monthly sales as $10,000, you compare it to last year’s report ($9,800) and plot a graph to ensure there’s no sudden jump that looks wrong.
13. What is data deduplication, and how is it performed?
Answer: Data deduplication is removing duplicate records to keep only one copy of each unique entry. It’s like cleaning out a contact list so you don’t have the same person listed twice. You can:
- Identify Duplicates: Use tools like Excel or Python to find identical rows.
- Remove Duplicates: Keep one record and delete the rest.
- Check Key Columns: Focus on unique identifiers (e.g., customer ID) to spot duplicates.
Example: A customer list has two entries for “John Smith, ID 123.” In Excel, you use the “Remove Duplicates” tool on the “ID” column to keep only one record.
14. Explain the importance of data governance in analytics.
Answer: Data governance is a set of rules to manage data properly, ensuring it’s accurate, secure, and used correctly. It’s like having school rules to keep things organized. It’s important because:
- It keeps data trustworthy for analysis.
- It protects sensitive data (like customer info).
- It ensures everyone follows the same data standards.
Example: A company’s data governance policy might require all customer data to be anonymized before analysis to protect privacy.
15. What are the challenges of working with real-time data?
Answer: Real-time data is data that updates instantly, like stock prices. Challenges include:
- Speed: You need fast tools to process data as it arrives.
- Errors: Real-time data can have glitches (e.g., a wrong price).
- Volume: Lots of data comes quickly, which can overwhelm systems.
- Accuracy: Ensuring the data is correct in real-time is hard.
Example: A website tracking live user clicks might crash if 10,000 users click at once, or it might record some clicks incorrectly during a server lag.
16. How do you choose the right visualization for a dataset?
Answer: Choosing a visualization depends on what you want to show:
- Compare Values: Use bar charts (e.g., sales by product).
- Show Trends: Use line charts (e.g., sales over time).
- Show Proportions: Use pie charts (e.g., market share).
- Show Relationships: Use scatter plots (e.g., price vs. sales).
- Show Distributions: Use histograms (e.g., customer age groups).
Example: To show how a store’s sales changed from January to December, a line chart is best because it highlights the trend over time.
17. What is the difference between batch processing and stream processing?
Answer:
- Batch Processing: Analyzing data in large groups at once, like processing all sales from last month. It’s slower but good for big, stable datasets.
- Stream Processing: Analyzing data as it arrives, like processing live website clicks. It’s faster but harder to manage.
Example:
- Batch: Calculating total sales for 2023 at year-end.
- Stream: Tracking live user activity on a website during a sale.
18. How do you prioritize tasks when analyzing multiple datasets?
Answer: To prioritize:
- Understand Goals: Focus on datasets that answer the most important business question.
- Check Deadlines: Work on tasks with urgent deadlines first.
- Assess Impact: Prioritize datasets that affect big decisions (e.g., sales vs. website clicks).
- Evaluate Effort: Start with quicker tasks to build momentum.
- Communicate: Ask your manager if you’re unsure what’s most important.
Example: If a company needs sales data for a board meeting tomorrow but also wants website data next week, you analyze the sales data first.
19. What is data skew, and how does it affect analysis?
Answer: Data skew is when data is unevenly distributed, like having way more sales in one month than others. It can mess up analysis by making results misleading. For example, an average might look high because of one extreme month.
Example: If a store’s sales are $1,000 most months but $50,000 in December, the average sales will seem high, hiding the normal trend. You might fix this by analyzing months separately.
20. Explain the concept of data lineage and its importance.
Answer: Data lineage is like a map showing where data comes from and how it’s changed. It tracks the data’s journey from its source (e.g., a sales system) to your final report. It’s important because:
- It helps you trust the data by showing its origin.
- It makes it easier to find and fix errors.
- It ensures compliance with rules (e.g., for privacy).
Example: If a report shows wrong sales numbers, data lineage shows the data came from a faulty sales database, so you can fix the source.
SQL: Answers to 40 Questions
21. What is the difference between WHERE and HAVING clauses?
Why It’s Important: As a data analyst, you’ll often filter data to focus on specific insights. Knowing when to use WHERE vs. HAVING ensures accurate results, especially when summarizing data with GROUP BY.
Detailed Answer:
- WHERE: Filters individual rows before any grouping happens. It’s like picking specific items from a grocery list based on a rule (e.g., only apples costing less than $2).
- HAVING: Filters groups after GROUP BY is applied. It’s like looking at a summary of purchases by category and only keeping categories with total sales over $100.
Step-by-Step:
- WHERE is applied first to raw data, reducing the number of rows.
- GROUP BY groups the filtered rows.
- HAVING checks the grouped results and filters them.
Example 1:
- Table: employees (name, department, salary)
- John, Sales, 60000
- Jane, Sales, 70000
- Bob, HR, 50000
- Query: SELECT department, COUNT(*) FROM employees WHERE salary > 55000 GROUP BY department;
- Filters rows with salary > 55000 (John, Jane), then groups by department.
- Result: Sales, 2
- Query: SELECT department, COUNT(*) FROM employees GROUP BY department HAVING COUNT(*) > 1;
- Groups all rows, then keeps departments with more than 1 employee.
- Result: Sales, 2
Example 2:
- Query: SELECT department, AVG(salary) FROM employees WHERE salary > 50000 GROUP BY department HAVING AVG(salary) > 65000;
- WHERE filters employees with salary > 50000.
- GROUP BY calculates average salary per department.
- HAVING keeps departments where the average salary is > 65000.
Interview Follow-Up:
- “Can you use WHERE and HAVING together?” (Yes, as shown above.)
- “What happens if you use HAVING without GROUP BY?” (It’s rare and may cause errors in some databases.)
22. Write a query to find the second-highest salary in a table.
Why It’s Important: This tests your ability to solve ranking problems, which are common in data analysis when identifying top performers or outliers.
Detailed Answer: To find the second-highest salary, you need to exclude the highest salary and then find the maximum of the remaining values.
Query:
sqlCopy
SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);
Step-by-Step:
- Inner subquery (SELECT MAX(salary) FROM employees) finds the highest salary.
- WHERE salary < … excludes the highest salary.
- Outer MAX(salary) finds the highest remaining salary.
Example 1:
- Table: employees (name, salary)
- John, 60000
- Jane, 80000
- Bob, 70000
- Highest salary: 80000.
- Query finds max salary < 80000, which is 70000.
- Result: 70000
Example 2 (Using LIMIT/OFFSET):
- Alternative Query:
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
- Sorts salaries in descending order (80000, 70000, 60000), skips the first (OFFSET 1), and takes the next (LIMIT 1).
- Result: 70000
Interview Follow-Up:
- “What if there are duplicate salaries?” (Use DISTINCT to avoid duplicates.)
- “How would you find the nth highest salary?” (Generalize with LIMIT/OFFSET or a ranking function.)
23. How do you handle NULL values in SQL queries?
Why It’s Important: NULLs are common in real-world data, and mishandling them can lead to incorrect analysis. Data analysts must know how to detect, filter, or replace NULLs.
Detailed Answer: NULL represents missing or unknown data. It’s not zero or an empty string. To handle NULLs:
- Check for NULL: Use IS NULL to find NULLs or IS NOT NULL to exclude them.
- Replace NULL: Use COALESCE(value1, value2, …) to return the first non-NULL value or IFNULL(value, default) in MySQL.
- Aggregate Functions: Functions like SUM or AVG ignore NULLs, but COUNT(*) includes them.
Step-by-Step:
- Identify columns with NULLs using SELECT column WHERE column IS NULL.
- Decide whether to exclude (filter), replace (COALESCE), or keep NULLs.
- Test the query to ensure NULLs don’t skew results.
Example 1:
- Table: orders (order_id, customer_name)
- 1, John
- 2, NULL
- Query: SELECT order_id, COALESCE(customer_name, ‘Unknown’) FROM orders;
- Replaces NULL with “Unknown.”
- Result:
- 1, John
- 2, Unknown
Example 2:
- Query: SELECT COUNT(customer_name) AS named, COUNT(*) AS total FROM orders;
- COUNT(customer_name) counts non-NULL names (1).
- COUNT(*) counts all rows (2).
- Result: named = 1, total = 2
Interview Follow-Up:
- “What’s the difference between NULL and an empty string?” (NULL is no value; an empty string is a value.)
- “How do NULLs affect JOINs?” (They don’t match in joins unless explicitly handled.)
24. Explain the difference between INNER JOIN and LEFT JOIN.
Why It’s Important: Joins are critical for combining data from multiple tables, a core task for data analysts working with relational databases.
Detailed Answer:
- INNER JOIN: Returns only rows where there’s a match in both tables. It’s like finding friends who are in both Club A and Club B.
- LEFT JOIN: Returns all rows from the left table, with matching rows from the right table (NULL if no match). It’s like listing all Club A members and their Club B status (NULL if not in Club B).
Step-by-Step:
- Identify the tables and the joining condition (e.g., ON customers.id = orders.customer_id).
- Use INNER JOIN for matched data only, or LEFT JOIN to include all left table rows.
- Check results for NULLs in LEFT JOIN to understand unmatched rows.
Example 1:
- Tables:
- customers (id, name): (1, John), (2, Jane)
- orders (order_id, customer_id): (101, 1)
- INNER JOIN Query: SELECT customers.name, orders.order_id FROM customers INNER JOIN orders ON customers.id = orders.customer_id;
- Result: John, 101 (only matched rows)
- LEFT JOIN Query: SELECT customers.name, orders.order_id FROM customers LEFT JOIN orders ON customers.id = orders.customer_id;
- Result:
- John, 101
- Jane, NULL
- Result:
Example 2:
- Query: SELECT customers.name, COUNT(orders.order_id) FROM customers LEFT JOIN orders ON customers.id = orders.customer_id GROUP BY customers.name;
- Counts orders per customer, including those with zero orders (Jane gets 0).
Interview Follow-Up:
- “What’s a RIGHT JOIN?” (Like LEFT JOIN but keeps all right table rows.)
- “When would you use INNER JOIN over LEFT JOIN?” (When you only want matched data, e.g., customers with orders.)
25. Write a query to calculate the running total of sales by month.
Why It’s Important: Running totals help analysts track cumulative metrics over time, like sales growth, which is common in business reporting.
Detailed Answer: A running total adds up values as you move through rows, ordered by something like date. Use the SUM window function with OVER to calculate it.
Query:
SELECT
MONTH(order_date) AS month,
SUM(sales_amount) AS monthly_sales,
SUM(SUM(sales_amount)) OVER (ORDER BY MONTH(order_date)) AS running_total
FROM sales
GROUP BY MONTH(order_date);
Step-by-Step:
- GROUP BY MONTH(order_date) groups sales by month.
- SUM(sales_amount) calculates total sales per month.
- SUM(SUM(sales_amount)) OVER (ORDER BY MONTH(order_date)) adds monthly sales cumulatively.
Example 1:
- Table: sales (order_date, sales_amount)
- 2024-01-01, 100
- 2024-01-02, 200
- 2024-02-01, 300
- Result:
- Month 1, 300, 300
- Month 2, 300, 600
Example 2:
- Query with Year:
SELECT
YEAR(order_date), MONTH(order_date),
SUM(sales_amount) AS monthly_sales,
SUM(SUM(sales_amount)) OVER (ORDER BY YEAR(order_date), MONTH(order_date)) AS running_total
FROM sales
GROUP BY YEAR(order_date), MONTH(order_date);
- Handles multi-year data, e.g., 2023-12 to 2024-01.
Interview Follow-Up:
- “How would you calculate a running total by customer?” (Add PARTITION BY customer_id to the OVER clause.)
- “What if you want a running total for all rows without grouping?” (Remove GROUP BY and use SUM(sales_amount) OVER (ORDER BY order_date).)
26. What is a self-join, and when would you use it?
Why It’s Important: Self-joins are useful for analyzing relationships within a single table, like organizational hierarchies or paired data, which analysts encounter in HR or network data.
Detailed Answer: A self-join joins a table to itself, treating it as two separate tables. It’s used when you need to compare rows within the same table, like finding employees and their managers.
Step-by-Step:
- Alias the table twice (e.g., e1 and e2) to treat it as two tables.
- Join on a condition relating columns (e.g., e1.manager_id = e2.id).
- Use LEFT JOIN to include rows without matches (e.g., employees without managers).
Example 1:
- Table: employees (id, name, manager_id)
- 1, John, NULL
- 2, Jane, 1
- 3, Bob, 1
- Query:
SELECT e1.name AS employee, e2.name AS manager
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.id;
- Result:
- Jane, John
- Bob, John
- John, NULL
Example 2:
- Use case: Find pairs of employees in the same department.
- Query:
SELECT e1.name, e2.name, e1.department
FROM employees e1
JOIN employees e2
ON e1.department = e2.department AND e1.id < e2.id;
- Avoids duplicate pairs by using e1.id < e2.id.
Interview Follow-Up:
- “How do you avoid duplicate pairs in a self-join?” (Use a condition like e1.id < e2.id.)
- “Can you use a self-join for time-based comparisons?” (Yes, e.g., compare an employee’s salary changes over time.)
27. How do you use the CASE statement in SQL?
Why It’s Important: The CASE statement allows analysts to add custom logic to queries, like categorizing data or creating flags, which is essential for flexible reporting.
Detailed Answer: The CASE statement works like “if-then-else” logic. It checks conditions and returns values based on them, useful for transforming or categorizing data.
Syntax:
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
ELSE result3
END
Step-by-Step:
- Identify the column or condition to evaluate.
- Write conditions in WHEN clauses.
- Specify results for each condition and an ELSE for unmatched cases.
- Use the CASE result as a new column.
Example 1:
- Table: employees (name, salary)
- John, 80000
- Jane, 60000
- Bob, 40000
- Query:
SELECT name, salary,
CASE
WHEN salary > 70000 THEN 'High'
WHEN salary > 50000 THEN 'Medium'
ELSE 'Low'
END AS salary_level
FROM employees;
- Result:
- John, 80000, High
- Jane, 60000, Medium
- Bob, 40000, Low
Example 2:
- Query: SELECT name, CASE WHEN salary IS NULL THEN ‘Unknown’ ELSE ‘Known’ END AS salary_status FROM employees;
- Flags rows with missing salaries.
Interview Follow-Up:
- “Can you use CASE in a WHERE clause?” (Yes, e.g., WHERE CASE WHEN salary > 50000 THEN 1 ELSE 0 END = 1.)
- “How does CASE compare to IF in MySQL?” (CASE is standard SQL; IF is MySQL-specific.)
28. Write a query to find duplicate records in a table.
Why It’s Important: Duplicate data can skew analysis, so identifying duplicates is a key data-cleaning skill for analysts.
Detailed Answer: To find duplicates, group by the column(s) you suspect have duplicates and use HAVING to find groups with more than one row.
Query:
SELECT email, COUNT(*)
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
Step-by-Step:
- Choose the column(s) to check for duplicates (e.g., email).
- Use GROUP BY to group rows with the same value.
- Use HAVING COUNT(*) > 1 to show only groups with duplicates.
- Optionally, select the full rows to inspect duplicates.
Example 1:
- Table: customers (id, email)
- Result:
Example 2 (Full Rows):
- Query:
SELECT id, email
FROM customers
WHERE email IN (
SELECT email
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
)
ORDER BY email;
- Result:
Interview Follow-Up:
- “How would you delete duplicates?” (Use a CTE or ROW_NUMBER to keep one row per duplicate.)
- “What if duplicates are across multiple columns?” (Group by all relevant columns, e.g., GROUP BY first_name, last_name.)
29. What is the difference between UNION and UNION ALL?
Why It’s Important: Combining datasets is common in analysis, and choosing between UNION and UNION ALL affects performance and results.
Detailed Answer:
- UNION: Combines results from two queries and removes duplicate rows. It’s slower because it sorts and deduplicates.
- UNION ALL: Combines results without removing duplicates. It’s faster because it doesn’t sort.
Step-by-Step:
- Write two compatible queries (same number and type of columns).
- Use UNION if you need unique rows, or UNION ALL for all rows.
- Check performance for large datasets (UNION ALL is usually better).
Example 1:
- Tables:
- table1 (name): John, Jane
- table2 (name): Jane, Bob
- UNION Query: SELECT name FROM table1 UNION SELECT name FROM table2;
- Result: John, Jane, Bob
- UNION ALL Query: SELECT name FROM table1 UNION ALL SELECT name FROM table2;
- Result: John, Jane, Jane, Bob
Example 2:
- Query: SELECT customer_id FROM orders_2023 UNION ALL SELECT customer_id FROM orders_2024;
- Combines customer IDs from two years, keeping duplicates for analysis.
Interview Follow-Up:
- “When would you prefer UNION ALL?” (For large datasets or when duplicates are okay.)
- “How do you ensure UNION queries are compatible?” (Match column count and data types.)
30. How do you optimize a slow-running SQL query?
Why It’s Important: Query performance is critical for analysts working with large datasets, as slow queries can delay insights.
Detailed Answer: Optimizing a query means making it run faster. Common techniques:
- Add Indexes: Indexes speed up searches on columns in WHERE, JOIN, or ORDER BY.
- Select Specific Columns: Use SELECT name, salary instead of SELECT *.
- Filter Early: Apply WHERE conditions to reduce rows before joining.
- Simplify Joins: Use the smallest table first or rewrite subqueries as JOINs.
- Use Query Plans: Run EXPLAIN to see how the database executes the query.
- Avoid Functions in WHERE: E.g., WHERE YEAR(date) = 2024 is slower than WHERE date >= ‘2024-01-01’.
Step-by-Step:
- Run EXPLAIN to identify bottlenecks (e.g., full table scans).
- Check for missing indexes on frequently filtered columns.
- Rewrite the query to reduce complexity (e.g., replace subqueries).
- Test performance with a small dataset first.
Example 1:
- Slow Query: SELECT * FROM orders WHERE customer_id = 5;
- Optimization:
- Add index: CREATE INDEX idx_customer_id ON orders(customer_id);
- Change to: SELECT order_id, order_date FROM orders WHERE customer_id = 5;
- Result: Faster because it uses the index and fetches fewer columns.
Example 2:
- Slow Query: SELECT name FROM employees WHERE UPPER(name) = ‘JOHN’;
- Optimization: SELECT name FROM employees WHERE name = ‘John’; (Avoids function on column.)
Interview Follow-Up:
- “How do indexes affect INSERT performance?” (They slow down INSERT/UPDATE because the index must be updated.)
- “What’s a covering index?” (An index that includes all columns needed by a query, avoiding table access.)
31. Write a query to pivot a table from rows to columns.
Why It’s Important: Pivoting is a common task for analysts creating reports, like summarizing sales by month or region in a spreadsheet-like format.
Detailed Answer: Pivoting transforms rows into columns, e.g., turning monthly sales into columns for each month. Use conditional aggregation or PIVOT (if supported).
Query:
SELECT
product,
SUM(CASE WHEN month = 'Jan' THEN sales ELSE 0 END) AS Jan,
SUM(CASE WHEN month = 'Feb' THEN sales ELSE 0 END) AS Feb
FROM sales
GROUP BY product;
Step-by-Step:
- Identify the row values to become columns (e.g., months).
- Use CASE to assign values to each new column.
- Aggregate (e.g., SUM) to combine values per group.
- Group by the non-pivoted column (e.g., product).
Example 1:
- Table: sales (product, month, sales)
- Phone, Jan, 100
- Phone, Feb, 200
- Laptop, Jan, 300
- Result:
- Phone, 100, 200
- Laptop, 300, 0
Example 2 (Using PIVOT in SQL Server):
SELECT * FROM sales
PIVOT (SUM(sales) FOR month IN (Jan, Feb)) AS pvt
GROUP BY product;
Interview Follow-Up:
- “How would you unpivot data?” (Use UNPIVOT or multiple UNIONs to turn columns into rows.)
- “What if the months are dynamic?” (Use dynamic SQL to generate columns based on data.)
32. Explain the use of window functions in SQL.
Why It’s Important: Window functions are powerful for analysts, enabling calculations like rankings, running totals, or comparisons without collapsing rows, unlike GROUP BY.
Detailed Answer: Window functions perform calculations across a “window” of rows defined by OVER. They keep all rows in the result, unlike GROUP BY, which summarizes.
Common Window Functions:
- ROW_NUMBER(): Assigns unique numbers.
- RANK(): Assigns ranks with ties.
- SUM() OVER: Calculates running or grouped totals.
- LAG(): Accesses previous row values.
Step-by-Step:
- Choose a window function (e.g., SUM).
- Define the window with OVER (e.g., PARTITION BY department ORDER BY salary).
- Include in SELECT to add the calculated column.
Example 1:
- Table: sales (order_date, amount)
- 2024-01-01, 100
- 2024-01-02, 200
- Query:
SELECT order_date, amount,
SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM sales;
- Result:
- 2024-01-01, 100, 100
- 2024-01-02, 200, 300
Example 2:
- Query: SELECT name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank FROM employees;
- Ranks employees within each department.
Interview Follow-Up:
- “What’s the difference between window functions and GROUP BY?” (Window functions keep all rows; GROUP BY summarizes.)
- “How do you limit the window?” (Use ROWS BETWEEN to define a range, e.g., ROWS BETWEEN 1 PRECEDING AND CURRENT ROW.)
33. Write a query to find the top 3 customers by total purchases.
Why It’s Important: Ranking customers by purchases is a common analytical task for identifying key clients or segments.
Detailed Answer: Sum purchases per customer, sort by total, and limit to the top 3.
Query:
SELECT customer_id, SUM(amount) AS total_purchases
FROM orders
GROUP BY customer_id
ORDER BY total_purchases DESC
LIMIT 3;
Step-by-Step:
- GROUP BY customer_id groups orders by customer.
- SUM(amount) calculates total purchases.
- ORDER BY total_purchases DESC sorts from highest to lowest.
- LIMIT 3 takes the top 3.
Example 1:
- Table: orders (customer_id, amount)
- 1, 500
- 1, 300
- 2, 400
- 3, 1000
- Result:
- 3, 1000
- 1, 800
- 2, 400
Example 2 (Using Window Function):
- Query:
SELECT customer_id, total_purchases
FROM (
SELECT customer_id, SUM(amount) AS total_purchases,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rnk
FROM orders
GROUP BY customer_id
) t
WHERE rnk <= 3;
- Handles ties better (e.g., two customers tied for third).
Interview Follow-Up:
- “What if there’s a tie for third place?” (Use RANK() or DENSE_RANK() to include ties.)
- “How would you include customer names?” (Join with customers table on customer_id.)
34. What is a subquery, and when is it useful?
Why It’s Important: Subqueries break down complex problems into manageable steps, a key skill for analysts solving multi-step analytical tasks.
Detailed Answer: A subquery is a query inside another query, enclosed in parentheses. It’s useful for:
- Filtering based on another result (e.g., employees above average salary).
- Calculating intermediate values (e.g., max salary per department).
- Simplifying complex logic.
Types:
- Single-row: Returns one value (e.g., SELECT MAX(salary)).
- Multi-row: Returns multiple values (e.g., SELECT id WHERE department IN (…)).
Step-by-Step:
- Write the inner query to get a specific result.
- Use it in the outer query’s WHERE, SELECT, or FROM clause.
- Ensure the subquery returns the expected number of rows.
Example 1:
- Query:
SELECT name
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
- Subquery calculates average salary; outer query finds employees above it.
Example 2:
- Query:
SELECT department, MAX(salary)
FROM employees
WHERE department IN (SELECT department FROM departments WHERE active = 1)
GROUP BY department;
- Subquery finds active departments; outer query finds max salary per active department.
Interview Follow-Up:
- “What’s a correlated subquery?” (A subquery that references the outer query, covered in Q54.)
- “When should you avoid subqueries?” (Use JOINs for better performance in some cases.)
35. How do you use the GROUP BY clause in SQL?
Why It’s Important: GROUP BY is essential for summarizing data, like calculating totals or averages, which analysts use for reports and dashboards.
Detailed Answer: GROUP BY groups rows with the same value in specified columns and applies aggregate functions (e.g., SUM, COUNT, AVG) to each group.
Step-by-Step:
- Select columns to group by and aggregate functions.
- Use GROUP BY with the non-aggregated columns.
- Optionally, filter groups with HAVING.
Example 1:
- Table: orders (region, amount)
- North, 500
- North, 300
- South, 400
- Query:
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region;
- Result:
- North, 800
- South, 400
Example 2:
- Query:
SELECT region, COUNT(*) AS order_count
FROM orders
GROUP BY region
HAVING COUNT(*) > 1;
- Only shows regions with multiple orders (e.g., North, 2).
Interview Follow-Up:
- “What happens if you omit GROUP BY with aggregates?” (Some databases return one row; others error.)
- “Can you group by multiple columns?” (Yes, e.g., GROUP BY region, year.)
36. Write a query to calculate the average order value by region.
Why It’s Important: Calculating averages by group is a common task for analysts studying performance across segments, like regions or products.
Detailed Answer: Use AVG with GROUP BY to compute the average order amount per region.
Query:
SELECT region, AVG(amount) AS avg_order_value
FROM orders
GROUP BY region;
Step-by-Step:
- GROUP BY region groups orders by region.
- AVG(amount) calculates the average order amount per region.
- Return the region and average as columns.
Example 1:
- Table: orders (region, amount)
- North, 500
- North, 300
- South, 400
- Result:
- North, 400
- South, 400
Example 2:
- Query:
SELECT region, ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
GROUP BY region
HAVING AVG(amount) > 300;
- Rounds to 2 decimals and only shows regions with average > 300.
Interview Follow-Up:
- “How do NULLs affect AVG?” (They’re ignored by AVG.)
- “How would you include regions with no orders?” (Use LEFT JOIN with a regions table.)
37. What is the difference between a primary key and a foreign key?
Why It’s Important: Understanding keys is crucial for analysts working with relational databases, as they ensure data integrity and enable joins.
Detailed Answer:
- Primary Key: A unique identifier for each row in a table, like a student ID. It must be unique and non-NULL.
- Foreign Key: A column in one table that links to a primary key in another table, like a student ID in a grades table linking to the students table.
Step-by-Step:
- Define a primary key when creating a table (e.g., id PRIMARY KEY).
- Use foreign keys in related tables to reference the primary key.
- Ensure foreign keys match the primary key’s data type.
Example 1:
- Tables:
- students (student_id, name): student_id is primary key.
- grades (grade_id, student_id, score): student_id is foreign key linking to students.student_id.
- Query:
SELECT s.name, g.score
FROM students s
JOIN grades g ON s.student_id = g.student_id;
Example 2:
- Composite Primary Key:
CREATE TABLE enrollments (
student_id INT,
course_id INT,
PRIMARY KEY (student_id, course_id)
);
- student_id and course_id together uniquely identify each enrollment.
Interview Follow-Up:
- “Can a table have multiple primary keys?” (No, but it can have a composite primary key.)
- “What happens if a foreign key references a deleted primary key?” (Depends on constraints, e.g., ON DELETE CASCADE deletes the foreign key row.)
38. How do you create a temporary table in SQL?
Why It’s Important: Temporary tables help analysts store intermediate results for complex queries, improving readability and performance.
Detailed Answer: A temporary table exists only during your database session and is automatically deleted when you disconnect. It’s like a scratchpad for calculations.
Query:
CREATE TEMPORARY TABLE temp_orders AS
SELECT * FROM orders WHERE order_date = '2024-01-01';
Step-by-Step:
- Use CREATE TEMPORARY TABLE followed by the table name.
- Define the table structure or use AS with a query to populate it.
- Use the temporary table in subsequent queries.
- No need to drop it—it’s gone when the session ends.
Example 1:
- Query:
CREATE TEMPORARY TABLE temp_high_value AS
SELECT customer_id, SUM(amount) AS total
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 1000;
SELECT * FROM temp_high_value;
- Creates a temporary table of customers with total orders > 1000.
Example 2:
- Manual Structure:
CREATE TEMPORARY TABLE temp_data (
id INT,
name VARCHAR(50)
);
INSERT INTO temp_data VALUES (1, 'John');
Interview Follow-Up:
- “When would you use a temporary table vs. a CTE?” (Temporary tables for multiple uses; CTEs for single-query readability.)
- “Can temporary tables be indexed?” (Yes, to improve performance.)
39. Write a query to find the most recent order for each customer.
Why It’s Important: Finding the latest records is common in customer analysis, like tracking recent purchases or activity.
Detailed Answer: Use MAX with GROUP BY to find the latest order date per customer.
Query:
SELECT customer_id, MAX(order_date) AS latest_order
FROM orders
GROUP BY customer_id;
Step-by-Step:
- GROUP BY customer_id groups orders by customer.
- MAX(order_date) finds the latest order date for each group.
- Return customer_id and the latest date.
Example 1:
- Table: orders (customer_id, order_date)
- 1, 2024-01-01
- 1, 2024-02-01
- 2, 2024-01-15
- Result:
- 1, 2024-02-01
- 2, 2024-01-15
Example 2 (Include Order Details):
- Query:
SELECT o.customer_id, o.order_date, o.order_id
FROM orders o
JOIN (
SELECT customer_id, MAX(order_date) AS max_date
FROM orders
GROUP BY customer_id
) t ON o.customer_id = t.customer_id AND o.order_date = t.max_date;
- Returns full order details for the latest order.
Interview Follow-Up:
- “WHAT IF MULTIPLE ORDERS OCCUR ON THE SAME DATE?” (USE RANK() OR ROW_NUMBER() TO PICK ONE.)
- “HOW WOULD YOU FIND THE SECOND-MOST-RECENT ORDER?” (USE RANK() WITH ORDER BY order_date DESC.)
40. Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER().
Why It’s Important: Ranking functions are key for analysts sorting or prioritizing data, like top customers or products.
Detailed Answer:
- RANK(): Assigns ranks, with ties getting the same rank, but skips numbers after ties (e.g., 1, 1, 3).
- DENSE_RANK(): Like RANK(), but doesn’t skip numbers (e.g., 1, 1, 2).
- ROW_NUMBER(): Assigns a unique number to each row, even with ties (e.g., 1, 2, 3).
Step-by-Step:
- Use in a window function with OVER (e.g., OVER (ORDER BY score DESC)).
- Choose RANK() for gapped ranks, DENSE_RANK() for continuous ranks, or ROW_NUMBER() for unique numbers.
- Use PARTITION BY to rank within groups.
Example 1:
- Table: scores (name, score)
- John, 90
- Jane, 90
- Bob, 80
- Query:
SELECT name, score,
RANK() OVER (ORDER BY score DESC) AS rank,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num
FROM scores;
- Result:
- John, 90, 1, 1, 1
- Jane, 90, 1, 1, 2
- Bob, 80, 3, 2, 3
Example 2:
- Query:
SELECT name, department, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
- Ranks employees within each department.
Interview Follow-Up:
- “How do you handle ties in RANK()?” (Use DENSE_RANK() or ROW_NUMBER() based on needs.)
- “What’s NTILE()?” (Another window function that divides rows into buckets, e.g., quartiles.)
41. How do you handle missing data in a SQL query?
Why It’s Important: Missing data (NULLs) can skew analysis, so analysts must handle them carefully to ensure accurate insights.
Detailed Answer: NULLs represent missing values. Strategies:
- Filter: Use WHERE column IS NULL or IS NOT NULL.
- Replace: Use COALESCE(value1, value2, …) or IFNULL(value, default) to provide defaults.
- Aggregate Behavior: SUM, AVG, etc., ignore NULLs; COUNT(column) skips NULLs.
- Join Considerations: NULLs don’t match in joins unless handled explicitly.
Step-by-Step:
- Identify NULLs with SELECT column WHERE column IS NULL.
- Decide to filter, replace, or keep NULLs based on analysis needs.
- Use COALESCE for defaults or CASE for custom logic.
- Verify results to ensure NULL handling doesn’t distort insights.
Example 1:
- Table: customers (name, phone)
- John, 123-456
- Jane, NULL
- Query:
SELECT name, COALESCE(phone, 'No Phone') AS phone
FROM customers;
- Result:
- John, 123-456
- Jane, No Phone
Example 2:
- Query:
SELECT AVG(COALESCE(salary, 0)) AS avg_salary
FROM employees;
- Replaces NULL salaries with 0 before averaging.
Interview Follow-Up:
- “How do you find rows with all NULLs?” (Use WHERE column1 IS NULL AND column2 IS NULL.)
- “What’s the impact of NULLs in aggregations?” (Most functions ignore NULLs, but replacing them changes results.)
42. Write a query to find the percentage contribution of each product to total sales.
Why It’s Important: Calculating proportions is a common task for analysts, like understanding product performance in a portfolio.
Detailed Answer: To find each product’s percentage of total sales, divide its sales by the overall total and multiply by 100.
Query:
SELECT
product,
SUM(sales) AS product_sales,
(SUM(sales) / (SELECT SUM(sales) FROM sales) * 100) AS percentage
FROM sales
GROUP BY product;
Step-by-Step:
- GROUP BY product groups sales by product.
- SUM(sales) calculates total sales per product.
- Subquery (SELECT SUM(sales) FROM sales) gets the overall total.
- Divide product sales by total and multiply by 100 for percentage.
Example 1:
- Table: sales (product, sales)
- Phone, 1000
- Laptop, 2000
- Total sales: 3000
- Result:
- Phone, 1000, 33.33
- Laptop, 2000, 66.67
Example 2 (Using Window Function):
- Query:
SELECT
product,
SUM(sales) AS product_sales,
(SUM(sales) / SUM(SUM(sales)) OVER () * 100) AS percentage
FROM sales
GROUP BY product;
- Avoids subquery by using window function.
Interview Follow-Up:
- “How would you round the percentage?” (Use ROUND(…, 2).)
- “What if some products have zero sales?” (Use LEFT JOIN with a products table.)
43. What is an index, and how does it improve query performance?
Why It’s Important: Indexes are critical for optimizing queries, especially for analysts working with large datasets where speed matters.
Detailed Answer: An index is a database structure that speeds up searches by organizing column values, like a book’s index helping you find pages quickly. It improves performance for:
- WHERE clauses
- JOIN conditions
- ORDER BY clauses
Trade-offs:
- Faster reads, but slower writes (INSERT/UPDATE) because the index must be updated.
- Uses extra storage space.
Step-by-Step:
- Identify frequently filtered or joined columns (e.g., customer_id).
- Create an index: CREATE INDEX idx_name ON table(column);.
- Use EXPLAIN to confirm the index is used.
Example 1:
- Table: orders (order_id, customer_id, amount)
- Query: SELECT * FROM orders WHERE customer_id = 5;
- Without index: Scans all rows (slow).
- With index: CREATE INDEX idx_customer_id ON orders(customer_id); (fast lookup).
Example 2:
- Composite Index:
CREATE INDEX idx_date_amount ON orders(order_date, amount);
- Speeds up SELECT * FROM orders WHERE order_date = ‘2024-01-01’ ORDER BY amount;.
Interview Follow-Up:
- “What’s a clustered vs. non-clustered index?” (Clustered stores data rows; non-clustered is a separate structure.)
- “When should you avoid indexes?” (On small tables or columns with frequent updates.)
44. How do you use the COALESCE function in SQL?
Why It’s Important: COALESCE helps analysts handle missing data cleanly, ensuring reports are complete and user-friendly.
Detailed Answer: COALESCE(value1, value2, …) returns the first non-NULL value in the list. It’s like saying, “Try this, then this, until you find something.”
Step-by-Step:
- Identify columns with potential NULLs.
- List values or defaults in COALESCE.
- Use in SELECT or WHERE to handle NULLs.
Example 1:
- Table: customers (name, phone, email)
- John, 123-456, NULL
- Jane, NULL, jane@example.com
- Query:
SELECT name, COALESCE(phone, email, 'No Contact') AS contact
FROM customers;
Result:
John, 123-456
Jane, jane@example.com
Example 2:
Query:
SELECT product, COALESCE(stock, 0) AS stock_level
FROM inventory;
- Replaces NULL stock with 0.
Interview Follow-Up:
- “What’s the difference between COALESCE and IFNULL?” (COALESCE takes multiple arguments; IFNULL takes two, MySQL-specific.)
- “Can COALESCE be used in aggregations?” (Yes, e.g., SUM(COALESCE(sales, 0)).)
45. Write a query to find customers who haven’t placed orders in the last 6 months.
Why It’s Important: Identifying inactive customers helps analysts target marketing or retention strategies.
Detailed Answer: Use a LEFT JOIN to find customers with no recent orders, filtering orders to the last 6 months.
Query:
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
AND o.order_date >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
WHERE o.order_id IS NULL;
Step-by-Step:
- LEFT JOIN customers with orders, restricting orders to last 6 months.
- WHERE o.order_id IS NULL finds customers with no matching orders.
- Select customer_id and name for the result.
Example 1:
- Tables:
- customers (customer_id, name): (1, John), (2, Jane)
- orders (order_id, customer_id, order_date): (101, 1, 2024-12-01)
- If today is 2025-04-25, 6 months ago is 2024-10-25.
- Result:
- 2, Jane (no orders since 2024-10-25)
Example 2:
- Query:
SELECT c.customer_id, c.name
FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
);
- Alternative using NOT EXISTS.
Interview Follow-Up:
- “How would you find customers with no orders ever?” (Remove the date condition.)
- “What if you want the last order date for active customers?” (Use a subquery or JOIN to get MAX(order_date).)
46. What is a CTE (Common Table Expression), and when is it used?
Why It’s Important: CTEs improve query readability and reusability, helping analysts write cleaner, more maintainable SQL for complex analyses.
Detailed Answer: A CTE is a named temporary result set defined with WITH that you can use within a query. It’s like saving a piece of your query to reuse or make it easier to read.
Syntax:
WITH cte_name AS (
SELECT ...
)
SELECT ... FROM cte_name;
Step-by-Step:
- Define the CTE with WITH and a query.
- Reference the CTE in the main query like a table.
- Use for breaking down complex logic or reusing results.
Example 1:
- Query:
WITH high_salary AS (
SELECT name, salary
FROM employees
WHERE salary > 50000
)
SELECT name
FROM high_salary
WHERE salary < 70000;
- CTE filters high salaries; main query narrows to a range.
Example 2:
- Query:
WITH sales_by_region AS (
SELECT region, SUM(sales) AS total
FROM sales
GROUP BY region
)
SELECT region, total, total / (SELECT SUM(total) FROM sales_by_region) * 100 AS percentage
FROM sales_by_region;
- Calculates regional sales percentages.
Interview Follow-Up:
- “What’s the difference between a CTE and a subquery?” (CTEs are named and reusable; subqueries are inline.)
- “Can you have multiple CTEs?” (Yes, separate with commas in the WITH clause.)
47. How do you perform a full outer join in SQL?
Why It’s Important: FULL OUTER JOINs are useful for analysts comparing datasets, like finding mismatches between two tables.
Detailed Answer: A FULL OUTER JOIN returns all rows from both tables, with NULLs where there’s no match. It’s like combining two lists and keeping everything, even unpaired items.
Step-by-Step:
- Use FULL OUTER JOIN with an ON condition.
- Check for NULLs in the result to identify unmatched rows.
- Note: Some databases (e.g., MySQL) don’t support FULL OUTER JOIN natively; use UNION of LEFT and RIGHT JOINs.
Example 1:
- Tables:
- customers (id, name): (1, John), (2, Jane)
- orders (order_id, customer_id): (101, 1), (102, 3)
- Query:
SELECT customers.name, orders.order_id
FROM customers
FULL OUTER JOIN orders ON customers.id = orders.customer_id;
- Result:
- John, 101
- Jane, NULL
- NULL, 102
Example 2 (MySQL Workaround):
- Query:
SELECT customers.name, orders.order_id
FROM customers
LEFT JOIN orders ON customers.id = orders.customer_id
UNION
SELECT customers.name, orders.order_id
FROM customers
RIGHT JOIN orders ON customers.id = orders.customer_id
WHERE customers.id IS NULL;
Interview Follow-Up:
- “When would you use a FULL OUTER JOIN?” (For data reconciliation, e.g., comparing two datasets.)
- “How does it differ from UNION?” (FULL OUTER JOIN matches rows; UNION stacks results.)
48. Write a query to calculate the year-over-year growth of sales.
Why It’s Important: Year-over-year growth is a key metric for analysts tracking business performance over time.
Detailed Answer: Use LAG to access the previous year’s sales and calculate the percentage change.
Query:
SELECT
YEAR(order_date) AS year,
SUM(sales) AS total_sales,
((SUM(sales) - LAG(SUM(sales)) OVER (ORDER BY YEAR(order_date))) /
LAG(SUM(sales)) OVER (ORDER BY YEAR(order_date)) * 100) AS yoy_growth
FROM sales
GROUP BY YEAR(order_date);
Step-by-Step:
- GROUP BY YEAR(order_date) groups sales by year.
- SUM(sales) calculates total sales per year.
- LAG(SUM(sales)) OVER (ORDER BY YEAR(order_date)) gets the previous year’s sales.
- Calculate growth: (current – previous) / previous * 100.
Example 1:
- Table: sales (order_date, sales)
- 2023-01-01, 1000
- 2024-01-01, 1200
- Result:
- 2023, 1000, NULL
- 2024, 1200, 20%
Example 2:
- Query:
SELECT
YEAR(order_date),
SUM(sales) AS total_sales,
ROUND(((SUM(sales) - LAG(SUM(sales)) OVER (ORDER BY YEAR(order_date))) /
LAG(SUM(sales)) OVER (ORDER BY YEAR(order_date)) * 100), 2) AS yoy_growth
FROM sales
GROUP BY YEAR(order_date);
- Rounds to 2 decimals.
Interview Follow-Up:
- “How would you handle missing years?” (Use COALESCE or generate missing years with a calendar table.)
- “What if you want growth by region?” (Add PARTITION BY region to LAG.)
49. What is the difference between a view and a materialized view?
Why It’s Important: Views and materialized views help analysts simplify queries or store results, improving efficiency in reporting.
Detailed Answer:
- View: A virtual table based on a query, always reflecting the latest data. It’s like a saved query that runs every time you use it.
- Materialized View: A physical table storing query results, updated periodically. It’s like saving the query output and refreshing it manually or on a schedule.
Step-by-Step:
- Create a view: CREATE VIEW view_name AS SELECT ….
- Create a materialized view (database-specific): CREATE MATERIALIZED VIEW ….
- Use views for dynamic data; materialized views for performance with static data.
Example 1:
- View:
CREATE VIEW active_customers AS
SELECT * FROM customers WHERE active = 1;
- Query: SELECT * FROM active_customers; always shows current active customers.
Example 2:
- Materialized View (PostgreSQL):
CREATE MATERIALIZED VIEW sales_summary AS
SELECT region, SUM(sales) FROM sales GROUP BY region;
- Needs REFRESH MATERIALIZED VIEW sales_summary; to update.
Interview Follow-Up:
- “When would you use a materialized view?” (For large, stable datasets where performance matters.)
- “Can views be indexed?” (Regular views can’t; materialized views can in some databases.)
50. How do you use the LIMIT clause to paginate query results?
Why It’s Important: Pagination is key for analysts building reports or dashboards that display data in chunks, like search results.
Detailed Answer: LIMIT restricts the number of rows returned, and OFFSET skips rows, enabling pagination (e.g., showing 10 rows per page).
Query:
SELECT * FROM orders
ORDER BY order_date
LIMIT 10 OFFSET 20;
Step-by-Step:
- Determine rows per page (e.g., 10).
- Calculate OFFSET = (page_number – 1) * rows_per_page.
- Use LIMIT for rows per page and ORDER BY for consistent sorting.
Example 1:
- Table: orders (order_id, order_date)
- Query for page 3 (rows 21-30):
SELECT * FROM orders
ORDER BY order_date
LIMIT 10 OFFSET 20;
Example 2:
- Query for page 2, 5 rows per page:
SELECT * FROM orders
ORDER BY order_date
LIMIT 5 OFFSET 5;
Interview Follow-Up:
- “How do you ensure consistent pagination?” (Use a stable ORDER BY, like order_id.)
- “What’s the performance impact of large OFFSETs?” (Slows down as OFFSET increases; consider keyset pagination.)
51. Write a query to find the longest streak of consecutive orders by a customer.
Why It’s Important: Streak analysis helps analysts study customer behavior, like loyalty or engagement patterns.
Detailed Answer: To find the longest streak of daily consecutive orders, use window functions to group orders by “streak periods.”
Query:
WITH order_groups AS (
SELECT
customer_id,
order_date,
DATEDIFF(order_date, LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date)) AS day_diff,
SUM(CASE WHEN DATEDIFF(order_date, LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date)) > 1
OR DATEDIFF(order_date, LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date)) IS NULL
THEN 1 ELSE 0 END)
OVER (PARTITION BY customer_id ORDER BY order_date) AS streak_group
FROM orders
)
SELECT customer_id, MAX(COUNT(*)) AS longest_streak
FROM order_groups
WHERE day_diff = 1 OR day_diff IS NULL
GROUP BY customer_id, streak_group
HAVING COUNT(*) > 1;
Step-by-Step:
- LAG(order_date) gets the previous order date per customer.
- DATEDIFF checks if orders are consecutive (1 day apart).
- SUM(CASE …) creates a new streak group when the gap is > 1 day.
- Group by customer_id and streak_group to count streak lengths.
- MAX(COUNT(*)) finds the longest streak.
Example 1:
- Table: orders (customer_id, order_date)
- 1, 2024-01-01
- 1, 2024-01-02
- 1, 2024-01-03
- 1, 2024-01-05
- Result:
- 1, 3 (Jan 1-3 is the longest streak)
Example 2:
- Table: orders (customer_id, order_date)
- 2, 2024-01-01
- 2, 2024-01-03
- Result: No streak (no consecutive days).
Interview Follow-Up:
- “What if orders are on the same day?” (Use DISTINCT order_date or adjust logic.)
- “How would you define a streak with a 2-day gap?” (Change day_diff = 1 to day_diff <= 2.)
52. How do you handle date and time functions in SQL?
Why It’s Important: Date and time functions are essential for analysts working with time-series data, like sales or user activity.
Detailed Answer: SQL provides functions to manipulate dates and times, varying by database (MySQL, PostgreSQL, SQL Server).
Common Functions:
- NOW() or CURRENT_DATE: Current date/time.
- DATEADD(unit, value, date): Adds units (e.g., days, months).
- DATEDIFF(unit, date1, date2): Difference between dates.
- MONTH(date), YEAR(date): Extracts parts of a date.
Step-by-Step:
- Identify the date operation (e.g., extract month, calculate difference).
- Choose the appropriate function for your database.
- Test with sample dates to ensure correct output.
Example 1:
- Query:
SELECT order_date, DATEDIFF(NOW(), order_date) AS days_since_order
FROM orders;
- Shows days since each order.
Example 2:
- Query:
SELECT order_date,
DATEADD(MONTH, 1, order_date) AS next_month,
MONTH(order_date) AS order_month
FROM orders;
- Adds a month and extracts the month numbersequences.
Interview Follow-Up:
- “How do you handle time zones?” (Use database-specific functions like AT TIME ZONE in PostgreSQL.)
- “What’s the difference between DATEADD and DATEDIFF?” (DATEADD modifies dates; DATEDIFF calculates differences.)
53. Write a query to find the median value in a numeric column.
Why It’s Important: The median is a robust measure of central tendency, useful for analysts studying skewed data, like salaries or sales.
Detailed Answer: The median is the middle value in a sorted list. SQL doesn’t have a direct MEDIAN function in most databases, so use a query to find it.
Query (MySQL):
sqlCopy
SELECT AVG(salary) AS median FROM ( SELECT salary FROM employees ORDER BY salary LIMIT 2 - (SELECT COUNT(*) FROM employees) % 2 OFFSET (SELECT (COUNT(*) - 1) / 2 FROM employees) ) AS sub;
Step-by-Step:
- Sort salaries with ORDER BY salary.
- Calculate the middle position: (COUNT(*) – 1) / 2.
- Use LIMIT 1 for odd counts or LIMIT 2 for even counts to average the two middle values.
- OFFSET skips to the middle.
Example 1:
- Table: employees (salary)
- 30000
- 40000
- 50000
- Sorted: 30000, 40000, 50000
- Median: 40000
- Result: 40000
Example 2 (PostgreSQL):
- Query:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median
FROM employees;
- Uses built-in median function.
Interview Follow-Up:
- “How does the median differ from the mean?” (Median is less affected by outliers.)
- “What if there are duplicate values?” (They don’t affect the median calculation.)
54. What is a correlated subquery, and how does it work?
Why It’s Important: Correlated subqueries are advanced but useful for analysts comparing rows against related data, like department averages.
Detailed Answer: A correlated subquery is a subquery that references values from the outer query, running for each row of the outer query, like a loop.
Step-by-Step:
- Write the outer query to select rows.
- Write the subquery to compute a value using outer query columns.
- Ensure the subquery returns one value per outer row.
- Note: Correlated subqueries can be slow; JOINs may be faster.
Example 1:
- Query:
SELECT name, salary
FROM employees e
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department
);
- For each employee, the subquery calculates the average salary in their department.
Example 2:
- Query:
SELECT order_id, customer_id
FROM orders o
WHERE order_date = (
SELECT MAX(order_date)
FROM orders
WHERE customer_id = o.customer_id
);
- Finds each customer’s latest order.
Interview Follow-Up:
- “Why are correlated subqueries slow?” (They run for each outer row, unlike regular subqueries.)
- “Can you rewrite a correlated subquery as a JOIN?” (Often yes, e.g., use a derived table or JOIN.)
55. How do you use the PARTITION BY clause in window functions?
Why It’s Important: PARTITION BY enables analysts to perform group-specific calculations, like ranking within categories, without grouping rows.
Detailed Answer: PARTITION BY divides rows into groups (partitions) for window functions, like GROUP BY but without collapsing rows.
Step-by-Step:
- Choose a window function (e.g., RANK, SUM).
- Use PARTITION BY in the OVER clause to define groups.
- Add ORDER BY in OVER if needed (e.g., for rankings).
Example 1:
- Table: employees (name, department, salary)
- John, Sales, 60000
- Jane, Sales, 70000
- Bob, HR, 50000
- Query:
SELECT name, department, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
- Result:
- John, Sales, 60000, 2
- Jane, Sales, 70000, 1
- Bob, HR, 50000, 1
Example 2:
- Query:
SELECT order_date, customer_id, amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_running_total
FROM orders;
- Running total per customer.
Interview Follow-Up:
- “What’s the difference between PARTITION BY and GROUP BY?” (PARTITION BY keeps all rows; GROUP BY summarizes.)
- “Can you use multiple columns in PARTITION BY?” (Yes, e.g., PARTITION BY region, year.)
56. Write a query to find the top-selling product in each category.
Why It’s Important: Identifying top performers by category is a common task for analysts studying product or market performance.
Detailed Answer: Use RANK() with PARTITION BY to rank products within each category by sales.
Query:
WITH ranked_products AS (
SELECT
category,
product,
SUM(sales) AS total_sales,
RANK() OVER (PARTITION BY category ORDER BY SUM(sales) DESC) AS sales_rank
FROM sales
GROUP BY category, product
)
SELECT category, product, total_sales
FROM ranked_products
WHERE sales_rank = 1;
Step-by-Step:
- GROUP BY category, product calculates total sales.
- RANK() OVER (PARTITION BY category ORDER BY SUM(sales) DESC) ranks products within each category.
- Filter for sales_rank = 1 to get top products.
Example 1:
- Table: sales (category, product, sales)
- Electronics, Phone, 1000
- Electronics, Laptop, 2000
- Clothing, Shirt, 500
- Result:
- Electronics, Laptop, 2000
- Clothing, Shirt, 500
Example 2:
- Query (Handle Ties):
WITH ranked_products AS (
SELECT category, product, SUM(sales),
DENSE_RANK() OVER (PARTITION BY category ORDER BY SUM(sales) DESC) AS sales_rank
FROM sales
GROUP BY category, product
)
SELECT category, product FROM ranked_products WHERE sales_rank = 1;
- Uses DENSE_RANK() for consistent ranking.
Interview Follow-Up:
- “What if multiple products tie for first?” (Use RANK() or DENSE_RANK(); RANK() includes all tied products.)
- “How would you find the top 3 products per category?” (Change WHERE sales_rank = 1 to sales_rank <= 3.)
57. What is the difference between TRUNCATE and DELETE in SQL?
Why It’s Important: Analysts must understand data modification commands to clean or reset datasets safely.
Detailed Answer:
- DELETE: Removes specific rows (with a WHERE clause) and can be rolled back if in a transaction. It’s like erasing certain lines from a notebook.
- TRUNCATE: Removes all rows and resets the table (e.g., auto-increment counters), is faster, and often can’t be rolled back. It’s like throwing out the whole notebook.
Step-by-Step:
- Use DELETE with a condition to remove specific data.
- Use TRUNCATE to empty a table completely.
- Check database permissions and transaction settings before running.
Example 1:
- DELETE:
DELETE FROM orders WHERE order_date < '2023-01-01';
Removes old orders.TRUNCATE:
TRUNCATE TABLE orders;
- Empties the entire table.
Example 2:
- Table with auto-increment: After TRUNCATE, the next order_id resets to 1; DELETE keeps the counter.
Interview Follow-Up:
- “Can you recover data after TRUNCATE?” (Depends on the database; often no without backups.)
- “What’s the difference between TRUNCATE and DROP?” (DROP removes the table structure; TRUNCATE keeps it.)
58. How do you use the DISTINCT keyword in SQL queries?
Why It’s Important: DISTINCT helps analysts eliminate duplicates, ensuring accurate counts or lists in reports.
Detailed Answer: DISTINCT removes duplicate rows from the result, keeping only unique combinations of selected columns.
Step-by-Step:
- Add DISTINCT after SELECT to apply to all selected columns.
- Use on specific columns (e.g., SELECT DISTINCT column1, column2) for unique pairs.
- Note: DISTINCT can slow queries on large datasets.
Example 1:
- Table: customers (name)
- John
- John
- Jane
- Query:
SELECT DISTINCT name
FROM customers;
Result:
John
Jane
Example 2:
Query:
- Returns unique region-product pairs.
Interview Follow-Up:
- “What’s the performance impact of DISTINCT?” (Can be slow due to sorting; indexes help.)
- “Can DISTINCT be used with aggregates?” (Yes, e.g., COUNT(DISTINCT name).)
59. Write a query to find overlapping date ranges in a table.
Why It’s Important: Overlapping date ranges are common in scheduling or event analysis, and analysts must detect conflicts.
Detailed Answer: To find overlapping date ranges, compare each pair of ranges where one starts before the other ends.
Query:
SELECT a.event_id, b.event_id
FROM events a
JOIN events b
ON a.event_id < b.event_id
AND a.start_date <= b.end_date
AND a.end_date >= b.start_date;
Step-by-Step:
- Join the table to itself, using a.event_id < b.event_id to avoid duplicate pairs.
- Check overlap: a.start_date <= b.end_date AND a.end_date >= b.start_date.
- Return the IDs of overlapping events.
Example 1:
- Table: events (event_id, start_date, end_date)
- 1, 2024-01-01, 2024-01-10
- 2, 2024-01-05, 2024-01-15
- 3, 2024-01-20, 2024-01-25
- Result:
- 1, 2 (Events 1 and 2 overlap)
Example 2:
- Query:
SELECT a.event_id, a.start_date, a.end_date, b.event_id, b.start_date, b.end_date
FROM events a
JOIN events b
ON a.event_id < b.event_id AND a.start_date <= b.end_date AND a.end_date >= b.start_date;
- Shows full details of overlapping events.
Interview Follow-Up:
- “How would you find non-overlapping ranges?” (Modify to NOT (a.start_date <= b.end_date AND …).)
- “What if events are for different users?” (Add WHERE a.user_id = b.user_id.)
60. How do you export query results to a CSV file in SQL?
Why It’s Important: Exporting data to CSV is a common task for analysts sharing results with stakeholders or tools like Excel.
Detailed Answer: Exporting to CSV depends on the database:
- MySQL: Use INTO OUTFILE.
- SQL Server: Use bcp or SSMS export.
- PostgreSQL: Use COPY.
- General: Run the query in a tool (e.g., Python, DBeaver) and save as CSV.
Query (MySQL):
SELECT * FROM orders
INTO OUTFILE '/path/to/orders.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
Step-by-Step:
- Write the query to select data.
- Use the database’s export command or tool.
- Specify the file path and format (e.g., comma-separated).
- Check permissions for writing files.
Example 1:
- MySQL Query:
SELECT order_id, order_date, amount
FROM orders
INTO OUTFILE '/tmp/orders.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';
- Creates a CSV with quoted values.
Example 2 (PostgreSQL):
- Query:
COPY (SELECT * FROM orders) TO '/path/to/orders.csv' WITH CSV HEADER;
- Includes column headers.
Interview Follow-Up:
- “How do you handle special characters in CSV?” (Use ENCLOSED BY ‘”‘ or escape characters.)
- “What if you can’t write to the server?” (Use a client tool like SQL Workbench to export.)
Python (30 Questions)
61. How do you read a CSV file into a pandas DataFrame?
Explanation: A CSV (Comma-Separated Values) file stores data like a table. The pandas library in Python makes it easy to read a CSV file into a DataFrame, which is like a table with rows and columns.
Steps:
- Import the pandas library.
- Use the pd.read_csv() function to read the CSV file.
- Store the result in a variable (the DataFrame).
Example:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv('data.csv')
# Display the first few rows
print(df.head())
Details:
- Replace ‘data.csv’ with the path to your CSV file.
- If the CSV file is not in the same folder as your script, provide the full path (e.g., ‘C:/folder/data.csv’).
- Use df.head() to preview the first 5 rows.
Sample CSV (data.csv):
name,age,city
Alice,25,New York
Bob,30,London
Charlie,35,Paris
Output:
name age city
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris
62. Write a function to handle missing values in a DataFrame.
Explanation: Missing values (like NaN or None) can cause problems in data analysis. You can handle them by filling them with a value (e.g., mean, median) or removing rows/columns with missing data.
Function:
import pandas as pd
def handle_missing_values(df, method='mean'):
"""
Handle missing values in a DataFrame.
- method: 'mean' (fill with column mean), 'median' (fill with median), 'drop' (remove rows)
"""
if method == 'drop':
# Remove rows with missing values
df_cleaned = df.dropna()
elif method == 'mean':
# Fill missing values with the mean of each numeric column
df_cleaned = df.fillna(df.mean(numeric_only=True))
elif method == 'median':
# Fill missing values with the median of each numeric column
df_cleaned = df.fillna(df.median(numeric_only=True))
else:
raise ValueError("Method must be 'mean', 'median', or 'drop'")
return df_cleaned
# Example usage
data = {'name': ['Alice', 'Bob', None], 'age': [25, None, 35], 'salary': [50000, 60000, None]}
df = pd.DataFrame(data)
# Fill missing values with mean
df_mean = handle_missing_values(df, method='mean')
print("Filled with mean:\n", df_mean)
# Drop rows with missing values
df_drop = handle_missing_values(df, method='drop')
print("\nDropped rows:\n", df_drop)
Output:
Filled with mean:
name age salary
0 Alice 25.0 50000.0
1 Bob 30.0 60000.0
2 None 35.0 55000.0
Dropped rows:
name age salary
0 Alice 25.0 50000.0
Details:
- The function checks the method parameter to decide how to handle missing values.
- df.fillna() replaces missing values; numeric_only=True ensures only numeric columns are filled.
- df.dropna() removes rows with any missing values.
- For non-numeric columns (e.g., ‘name’), you might fill with a specific value like ‘Unknown’ using df[‘name’].fillna(‘Unknown’).
63. How do you merge two DataFrames in pandas?
Explanation: Merging combines two DataFrames based on a common column (like a key). It’s similar to joining tables in a database. Pandas provides the merge() function for this.
Steps:
- Identify the common column(s) in both DataFrames.
- Use pd.merge() or df.merge() to combine them.
- Specify the type of merge: ‘inner’, ‘left’, ‘right’, or ‘outer’.
Example:
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 40]})
# Merge DataFrames on 'id' (inner merge)
merged_df = pd.merge(df1, df2, on='id', how='inner')
print(merged_df)
Output:
id name age
0 1 Alice 25
1 2 Bob 30
Details:
- on=’id’: Specifies the common column.
- how=’inner’: Keeps only rows where the ‘id’ exists in both DataFrames.
- Other merge types:
- ‘left’: Keeps all rows from df1.
- ‘right’: Keeps all rows from df2.
- ‘outer’: Keeps all rows from both, filling missing matches with NaN.
- If column names differ, use left_on and right_on (e.g., left_on=’id1′, right_on=’id2′).
64. Write a Python script to calculate the moving average of a time series.
Explanation: A moving average smooths out fluctuations in a time series by averaging a fixed number of previous data points (called the window).
Script:
import pandas as pd
import matplotlib.pyplot as plt
# Sample time series data
data = {'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'],
'value': [10, 12, 15, 11, 14]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# Calculate moving average (window of 3 days)
df['moving_avg'] = df['value'].rolling(window=3).mean()
# Plot the data
plt.plot(df['date'], df['value'], label='Original', marker='o')
plt.plot(df['date'], df['moving_avg'], label='Moving Average (3-day)', marker='x')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series with Moving Average')
plt.legend()
plt.show()
print(df)
Output:
date value moving_avg
0 2023-01-01 10 NaN
1 2023-01-02 12 NaN
2 2023-01-03 15 12.333333
3 2023-01-04 11 12.666667
4 2023-01-05 14 13.333333
Details:
- rolling(window=3) creates a sliding window of 3 data points.
- .mean() calculates the average for each window.
- The first two rows have NaN because there aren’t enough prior points for a 3-day average.
- The plot visualizes both the original data and the smoothed moving average.
65. What is the difference between .loc and .iloc in pandas?
Explanation:
- .loc: Selects rows and columns using labels (names of rows/columns).
- .iloc: Selects rows and columns using integer positions (like array indices).
Example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])
# Using .loc (label-based)
print("Using .loc:")
print(df.loc['x', 'A']) # Value at row 'x', column 'A'
print(df.loc['x':'y', 'B']) # Rows 'x' to 'y', column 'B'
# Using .iloc (index-based)
print("\nUsing .iloc:")
print(df.iloc[0, 0]) # Value at row 0, column 0
print(df.iloc[0:2, 1]) # Rows 0 to 1, column 1
Output:
Using .loc:
1
x 4
y 5
Name: B, dtype: int64
Using .iloc:
1
0 4
1 5
Name: B, dtype: int64
Details:
- .loc uses names (e.g., ‘x’, ‘A’), so it’s great when you know column names or index labels.
- .iloc uses numbers (e.g., 0, 1), like accessing a matrix.
- Use .loc for flexibility with named indices; use .iloc for position-based slicing.
66. How do you create a pivot table in pandas?
Explanation: A pivot table summarizes data by grouping and aggregating it, like in Excel. In pandas, use pivot_table() to create one.
Example:
import pandas as pd
# Sample data
data = {'category': ['A', 'A', 'B', 'B'], 'value': [10, 20, 15, 25], 'group': ['X', 'Y', 'X', 'Y']}
df = pd.DataFrame(data)
# Create pivot table
pivot = pd.pivot_table(df, values='value', index='category', columns='group', aggfunc='mean')
print(pivot)
Output:
group X Y
category
A 10.0 20.0
B 15.0 25.0
Details:
- values=’value’: The column to aggregate.
- index=’category’: Rows of the pivot table.
- columns=’group’: Columns of the pivot table.
- aggfunc=’mean’: How to aggregate (e.g., ‘sum’, ‘count’, ‘max’).
- Use fill_value=0 to replace NaN with 0 if needed.
67. Write a function to remove duplicate rows from a DataFrame.
Explanation: Duplicate rows can skew analysis. The drop_duplicates() function removes them.
Function:
import pandas as pd
def remove_duplicates(df, keep='first'):
"""
Remove duplicate rows from a DataFrame.
- keep: 'first' (keep first occurrence), 'last' (keep last), False (drop all duplicates)
"""
df_cleaned = df.drop_duplicates(keep=keep)
return df_cleaned
# Example usage
data = {'name': ['Alice', 'Bob', 'Alice'], 'age': [25, 30, 25]}
df = pd.DataFrame(data)
# Remove duplicates
df_no_duplicates = remove_duplicates(df)
print(df_no_duplicates)
Output:
name age
0 Alice 25
1 Bob 30
Details:
- drop_duplicates() checks for identical rows.
- keep=’first’ retains the first occurrence of duplicates.
- Use subset=[‘column’] to consider only specific columns for duplicates (e.g., df.drop_duplicates(subset=[‘name’])).
68. How do you handle categorical data in Python?
Explanation: Categorical data (e.g., ‘male’/’female’, ‘low’/’high’) needs to be encoded for machine learning. Common methods:
- Label Encoding: Assign numbers to categories (e.g., ‘male’=0, ‘female’=1).
- One-Hot Encoding: Create binary columns for each category.
Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample data
data = {'gender': ['male', 'female', 'female', 'male'], 'size': ['small', 'large', 'medium', 'small']}
df = pd.DataFrame(data)
# Label Encoding
le = LabelEncoder()
df['gender_encoded'] = le.fit_transform(df['gender'])
# One-Hot Encoding
df_onehot = pd.get_dummies(df, columns=['size'])
print("Label Encoded:\n", df)
print("\nOne-Hot Encoded:\n", df_onehot)
Output:
Label Encoded:
gender size gender_encoded
0 male small 1
1 female large 0
2 female medium 0
3 male small 1
One-Hot Encoded:
gender gender_encoded size_large size_medium size_small
0 male 1 0 0 1
1 female 0 1 0 0
2 female 0 0 1 0
3 male 1 0 0 1
Details:
- Use LabelEncoder from sklearn for label encoding.
- Use pd.get_dummies() for one-hot encoding in pandas.
- One-hot encoding creates new columns, which can increase DataFrame size for many categories.
69. Write a script to plot a histogram using Matplotlib.
Explanation: A histogram shows the distribution of numeric data by grouping values into bins.
Script:
import matplotlib.pyplot as plt
import numpy as np
# Sample data
data = [22, 25, 28, 22, 30, 25, 27, 29, 23, 26]
# Create histogram
plt.hist(data, bins=5, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of Data')
plt.show()
Output: A plot showing the frequency of values in 5 bins.
Details:
- bins=5: Number of bins to group data.
- edgecolor=’black’: Adds borders to bars for clarity.
- Use plt.grid(True) to add a grid if needed.
70. What is the difference between a Series and a DataFrame in pandas?
Explanation:
- Series: A 1-dimensional array-like structure (like a list or column) with an index.
- DataFrame: A 2-dimensional table with rows and columns, where each column is a Series.
Example:
import pandas as pd
# Series
series = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print("Series:\n", series)
# DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
print("\nDataFrame:\n", df)
Output:
Series:
a 1
b 2
c 3
dtype: int64
DataFrame:
A B
a 1 4
b 2 5
c 3 6
Details:
- A Series has one column of data; a DataFrame has multiple columns.
- Access a DataFrame column as a Series: df[‘A’].
- Series is useful for single-column operations; DataFrame for tabular data.
71. How do you apply a function to every element in a DataFrame column?
Explanation: Use apply() to apply a function to each element in a column or use vectorized operations for efficiency.
Example:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
# Define a function
def double(x):
return x * 2
# Apply function to column 'A'
df['A_doubled'] = df['A'].apply(double)
# Vectorized operation (faster)
df['B_doubled'] = df['B'] * 2
print(df)
Output:
A B A_doubled B_doubled
0 1 10 2 20
1 2 20 4 40
2 3 30 6 60a
Details:
- apply() is flexible for custom functions but slower for simple operations.
- Use vectorized operations (e.g., df[‘B’] * 2) for better performance.
72. Write a function to extract the top 5 rows based on a column value.
Explanation: Use nlargest() or sort_values() to get the top rows based on a column.
Function:
import pandas as pd
def get_top_5(df, column):
"""
Extract top 5 rows based on a column value.
"""
return df.nlargest(5, column)
# Example usage
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'score': [85, 92, 78, 95, 88, 90]}
df = pd.DataFrame(data)
top_5 = get_top_5(df, 'score')
print(top_5)
Output:
name score
3 David 95
1 Bob 92
5 Frank 90
4 Eve 88
0 Alice 85
Details:
- nlargest(5, column) is simple and efficient.
- Alternatively, use df.sort_values(column, ascending=False).head(5).
- If there are ties, nlargest() includes all tied rows unless specified.
73. How do you use groupby() in pandas to aggregate data?
Explanation: groupby() groups data by one or more columns and applies an aggregation function (e.g., mean, sum) to each group.
Example:
import pandas as pd
# Sample data
data = {'category': ['A', 'A', 'B', 'B'], 'value': [10, 20, 15, 25]}
df = pd.DataFrame(data)
# Group by 'category' and calculate mean and sum
grouped = df.groupby('category').agg({'value': ['mean', 'sum']})
print(grouped)
Output:
value
mean sum
category
A 15.0 30
B 20.0 40
Details:
- groupby(‘category’): Groups rows by unique values in ‘category’.
- agg() applies functions like ‘mean’, ‘sum’, ‘count’.
- Use multiple columns: df.groupby([‘col1’, ‘col2’]).
74. Write a script to create a scatter plot with a trend line.
Explanation: A scatter plot shows points for two variables, and a trend line (linear regression) shows their relationship.
Script:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import linregress
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
# Scatter plot
plt.scatter(x, y, label='Data Points')
# Calculate trend line
slope, intercept, _, _, _ = linregress(x, y)
trend_line = [slope * xi + intercept for xi in x]
plt.plot(x, trend_line, color='red', label='Trend Line')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot with Trend Line')
plt.legend()
plt.show()
Output: A scatter plot with points and a red trend line.
Details:
- linregress from scipy.stats calculates the slope and intercept.
- Alternatively, use np.polyfit(x, y, 1) for a simpler trend line.
- The trend line shows the general direction of the data.
75. How do you handle outliers in a dataset using Python?
Explanation: Outliers are extreme values that can distort analysis. Common methods:
- Z-score: Identify values far from the mean (e.g., >3 standard deviations).
- IQR (Interquartile Range): Remove values outside 1.5 * IQR from Q1/Q3.
Example (IQR Method):
import pandas as pd
# Sample data
data = {'value': [1, 2, 2, 3, 100, 4, 5, 200]}
df = pd.DataFrame(data)
# Calculate IQR
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df_no_outliers = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]
print("Original Data:\n", df)
print("\nNo Outliers:\n", df_no_outliers)
Output:
Original Data:
value
0 1
1 2
2 2
3 3
4 100
5 4
6 5
7 200
No Outliers:
value
0 1
1 2
2 2
3 3
5 4
6 5
Details:
- IQR method is robust for non-normal data.
- Z-score method: z = (x – mean) / std; remove if abs(z) > 3.
- Visualize outliers using box plots (plt.boxplot(df[‘value’])).
76. What is the purpose of the NumPy library in data analysis?
Explanation: NumPy (Numerical Python) is a library for numerical computations in Python. It’s used in data analysis for:
- Efficient Arrays: NumPy’s ndarray is faster than Python lists for math operations.
- Mathematical Functions: Supports operations like mean, median, matrix multiplication.
- Data Manipulation: Reshaping, slicing, and broadcasting arrays.
- Interoperability: Works with pandas, Matplotlib, and scikit-learn.
Example:
import numpy as np
# Create a NumPy array
data = np.array([1, 2, 3, 4, 5])
# Perform operations
mean = np.mean(data)
square = data ** 2
print("Array:", data)
print("Mean:", mean)
print("Squared:", square)
Output:
Array: [1 2 3 4 5]
Mean: 3.0
Squared: [ 1 4 9 16 25]
Details:
- NumPy arrays are fixed-size and homogeneous (same data type).
- Used as the backbone for pandas DataFrames and Series.
- Essential for large datasets and scientific computing.
77. Write a function to normalize a numeric column in a DataFrame.
Explanation: Normalization scales numeric values to a range (e.g., 0 to 1) to make them comparable. Common method: Min-Max normalization.
Function:
import pandas as pd
def normalize_column(df, column):
"""
Normalize a numeric column to range [0, 1] using Min-Max normalization.
"""
df[column + '_normalized'] = (df[column] - df[column].min()) / (df[column].max() - df[column].min())
return df
# Example usage
data = {'score': [10, 20, 15, 30, 25]}
df = pd.DataFrame(data)
df_normalized = normalize_column(df, 'score')
print(df_normalized)
Output:
score score_normalized
0 10 0.00
1 20 0.50
2 15 0.25
3 30 1.00
4 25 0.75
Details:
- Formula: (x – min) / (max – min).
- Adds a new column to avoid overwriting the original.
- Alternative: Standard scaling ((x – mean) / std) using df[column].sub(df[column].mean()).div(df[column].std()).
78. How do you use the apply() function in pandas?
Explanation: apply() runs a function on each element, row, or column of a DataFrame or Series.
Example:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
# Function to add 5
def add_five(x):
return x + 5
# Apply to a column (Series)
df['A_plus_5'] = df['A'].apply(add_five)
# Apply to entire DataFrame
df_plus_5 = df[['A', 'B']].apply(add_five)
# Apply to rows (axis=1)
df['row_sum'] = df[['A', 'B']].apply(sum, axis=1)
print("Original with column apply:\n", df)
print("\nDataFrame apply:\n", df_plus_5)
Output:
Original with column apply:
A B A_plus_5 row_sum
0 1 10 6 11
1 2 20 7 22
2 3 30 8 33
DataFrame apply:
A B
0 6 15
1 7 25
2 8 35
Details:
- apply() on a Series: Applies to each element.
- apply() on a DataFrame: Applies to each column (default) or row (axis=1).
- Use vectorized operations (e.g., df[‘A’] + 5) for better performance when possible.
79. Write a script to calculate correlation between two columns.
Explanation: Correlation measures how two numeric columns are related (e.g., Pearson correlation ranges from -1 to 1).
Script:
import pandas as pd
# Sample data
data = {'A': [1, 2, 3, 4, 5], 'B': [2, 4, 6, 8, 10]}
df = pd.DataFrame(data)
# Calculate Pearson correlation
correlation = df['A'].corr(df['B'])
print("Correlation between A and B:", correlation)
Output
Correlation between A and B: 1.0
Details:
- corr() computes Pearson correlation by default.
- A value of 1 means perfect positive correlation; -1 means perfect negative correlation; 0 means no correlation.
- Use method=’spearman’ or ‘kendall’ for non-linear relationships.
80. How do you convert a column to datetime in pandas?
Explanation: Converting a column to datetime allows date-based operations (e.g., extracting year, resampling).
Example:
import pandas as pd
# Sample data
data = {'date': ['2023-01-01', '2023-02-01', '2023-03-01']}
df = pd.DataFrame(data)
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
print(df)
print("\nData type:", df['date'].dtype)
Output:
date
0 2023-01-01
1 2023-02-01
2 2023-03-01
Data type: datetime64[ns]
Details:
- pd.to_datetime() handles various date formats.
- Specify format for speed: pd.to_datetime(df[‘date’], format=’%Y-%m-%d’).
- Access parts: df[‘date’].dt.year or df[‘date’].dt.month.
81. Write a function to create bigrams from a text column.
Explanation: Bigrams are pairs of consecutive words in text, useful for text analysis (e.g., “I love” and “love to” from “I love to learn”).
Function:
import pandas as pd
from nltk import bigrams
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
def create_bigrams(df, text_column):
"""
Create bigrams from a text column and add as a new column.
"""
df['bigrams'] = df[text_column].apply(lambda x: list(bigrams(word_tokenize(x.lower()))))
return df
# Example usage
data = {'text': ['I love to learn', 'Python is fun']}
df = pd.DataFrame(data)
df_with_bigrams = create_bigrams(df, 'text')
print(df_with_bigrams)
Output:
text bigrams
0 I love to learn [(i, love), (love, to), (to, learn)]
1 Python is fun [(python, is), (is, fun)]
Details:
- Requires nltk library (pip install nltk).
- word_tokenize splits text into words.
- bigrams() creates pairs of consecutive words.
- Convert text to lowercase for consistency.
82. How do you use the seaborn library for data visualization?
Explanation: Seaborn is a Python library built on Matplotlib for creating attractive statistical visualizations (e.g., heatmaps, box plots).
Example:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample data
data = {'category': ['A', 'A', 'B', 'B'], 'value': [10, 20, 15, 25]}
df = pd.DataFrame(data)
# Create a box plot
sns.boxplot(x='category', y='value', data=df)
plt.title('Box Plot with Seaborn')
plt.show()
Output:
text bigrams
0 I love to learn [(i, love), (love, to), (to, learn)]
1 Python is fun [(python, is), (is, fun)]
Details:
- Requires nltk library (pip install nltk).
- word_tokenize splits text into words.
- bigrams() creates pairs of consecutive words.
- Convert text to lowercase for consistency.
82. How do you use the seaborn library for data visualization?
Explanation: Seaborn is a Python library built on Matplotlib for creating attractive statistical visualizations (e.g., heatmaps, box plots).
Example:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample data
data = {'category': ['A', 'A', 'B', 'B'], 'value': [10, 20, 15, 25]}
df = pd.DataFrame(data)
# Create a box plot
sns.boxplot(x='category', y='value', data=df)
plt.title('Box Plot with Seaborn')
plt.show()
Output: A box plot comparing ‘value’ distributions for categories ‘A’ and ‘B’.
Details:
- Install: pip install seaborn.
- Common plots: sns.histplot(), sns.scatterplot(), sns.heatmap().
- Seaborn automatically handles styling and color palettes.
- Use sns.set_style(‘darkgrid’) to customize appearance.
83. Write a script to resample time-series data by day.
Explanation: Resampling aggregates time-series data over a time period (e.g., daily means from hourly data).
Script:
import pandas as pd
# Sample data
data = {'date': ['2023-01-01 01:00', '2023-01-01 02:00', '2023-01-02 01:00'],
'value': [10, 20, 15]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# Set date as index
df.set_index('date', inplace=True)
# Resample by day and calculate mean
daily_df = df.resample('D').mean()
print(daily_df)
Output:
value
date
2023-01-01 15.0
2023-01-02 15.0
Details:
- resample(‘D’): Aggregates by day (‘H’ for hour, ‘M’ for month).
- .mean() computes the average; use .sum(), .count(), etc., as needed.
- Requires a datetime index.
84. How do you handle memory issues when working with large datasets in pandas?
Explanation: Large datasets can consume too much memory. Strategies:
- Use dtype optimization: Use smaller data types (e.g., int32 instead of int64).
- Load chunks: Read CSV in smaller chunks.
- Drop unnecessary columns: Keep only relevant data.
- Use sparse data structures: For datasets with many zeros.
Example (Chunking):
import pandas as pd
# Read large CSV in chunks
chunk_size = 1000
chunks = pd.read_csv('large_data.csv', chunksize=chunk_size)
# Process each chunk
for chunk in chunks:
# Example: Calculate mean of a column
print(chunk['column_name'].mean())
Details:
- Check memory usage: df.memory_usage(deep=True).sum().
- Optimize dtypes: df[‘col’] = df[‘col’].astype(‘float32’).
- Use usecols in read_csv to load only needed columns.
- For very large datasets, consider libraries like dask or vaex.
85. Write a function to calculate the z-score for a column.
Explanation: The z-score measures how many standard deviations a value is from the mean. Formula: (x – mean) / std.
Function:
import pandas as pd
def calculate_zscore(df, column):
"""
Calculate z-score for a numeric column.
"""
df[column + '_zscore'] = (df[column] - df[column].mean()) / df[column].std()
return df
# Example usage
data = {'value': [10, 20, 15, 30, 25]}
df = pd.DataFrame(data)
df_zscore = calculate_zscore(df, 'value')
print(df_zscore)
Output:
value value_zscore
0 10 -1.463850
1 20 0.000000
2 15 -0.731925
3 30 1.463850
4 25 0.731925
Details:
- Adds a new column with z-scores.
- Use to detect outliers (e.g., abs(z) > 3).
- Assumes numeric data; check for NaN with df[column].dropna() if needed.
86. How do you use the lambda function in pandas?
Explanation: A lambda function is an anonymous function used for quick operations in pandas, often with apply() or assign().
Example:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
# Use lambda with apply
df['A_plus_5'] = df['A'].apply(lambda x: x + 5)
# Use lambda with assign
df = df.assign(B_squared=lambda x: x['B'] ** 2)
print(df)
Output:
A B A_plus_5 B_squared
0 1 10 6 100
1 2 20 7 400
2 3 30 8 900
Details:
- Syntax: lambda x: expression.
- Useful for one-off calculations.
- Avoid overusing apply with lambda for large datasets; vectorized operations are faster.
87. Write a script to create a box plot for a dataset.
Explanation: A box plot shows the distribution of data, including median, quartiles, and outliers.
Script:
import matplotlib.pyplot as plt
import pandas as pd
# Sample data
data = {'category': ['A', 'A', 'B', 'B'], 'value': [10, 20, 15, 25]}
df = pd.DataFrame(data)
# Create box plot
df.boxplot(column='value', by='category')
plt.title('Box Plot of Values by Category')
plt.xlabel('Category')
plt.ylabel('Value')
plt.show()
Output: A box plot comparing ‘value’ distributions for categories ‘A’ and ‘B’.
Details:
- Alternatively, use seaborn.boxplot(x=’category’, y=’value’, data=df) for a nicer look.
- Outliers appear as points outside the whiskers.
- Customize with plt.grid(True) or plt.suptitle(”) to remove default title.
88. How do you split a DataFrame into training and testing sets?
Explanation: Splitting data into training and testing sets is common in machine learning to evaluate models. Use train_test_split from sklearn.
Example:
import pandas as pd
from sklearn.model_selection import train_test_split
# Sample data
data = {'feature': [1, 2, 3, 4, 5], 'target': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
# Split into features (X) and target (y)
X = df[['feature']]
y = df['target']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training features:\n", X_train)
print("\nTesting features:\n", X_test)
Output:
Training features:
feature
4 5
2 3
0 1
3 4
Testing features:
feature
1 2
Details:
- test_size=0.2: 20% for testing, 80% for training.
- random_state=42: Ensures reproducibility.
- Use stratify=y for classification to balance classes.
89. Write a function to one-hot encode categorical variables.
Explanation: One-hot encoding converts categorical values into binary columns (0 or 1) for each category.
Function:
import pandas as pd
def one_hot_encode(df, columns):
"""
One-hot encode specified columns in a DataFrame.
"""
return pd.get_dummies(df, columns=columns)
# Example usage
data = {'name': ['Alice', 'Bob', 'Charlie'], 'category': ['A', 'B', 'A']}
df = pd.DataFrame(data)
df_encoded = one_hot_encode(df, ['category'])
print(df_encoded)
Output
name category_A category_B
0 Alice 1 0
1 Bob 0 1
2 Charlie 1 0
Details:
- pd.get_dummies() creates a new column for each unique category.
- Use drop_first=True to avoid multicollinearity (dummy variable trap).
- Original categorical column is replaced unless specified otherwise.
90. How do you export a DataFrame to an Excel file?
Explanation: Use to_excel() to save a DataFrame as an Excel file. Requires the openpyxl or xlsxwriter library.
Example:
import pandas as pd
# Sample data
data = {'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35]}
df = pd.DataFrame(data)
# Export to Excel
df.to_excel('output.xlsx', index=False)
Details:
- Install: pip install openpyxl.
- index=False: Excludes the index column from the output.
- Specify sheet_name=’Sheet1′ to name the sheet.
- File is saved in the current working directory; use a full path if needed.