SQL Practice Platform
Article

Deloitte Data Engineer Interview Questions for 2-5 Years Experience: Complete Preparation Guide (2026)

Deloitte Data Engineer Interview Questions for 2-5 Years Experience: Complete Preparation Guide (2026)

This guide focuses on the most frequently asked questions in Deloitte Data Engineer interviews based on recent experiences (2025–2026). It targets candidates with 2–5 years of experience and integrates well with your existing SQL performance series.

Deloitte Data Engineer Interview Questions: Complete Guide for 2-5 Years Experience (2026)

Deloitte Data Engineer interviews emphasize practical skills in building scalable, reliable data pipelines for enterprise clients. For 2–5 years experience, expect a mix of:

- SQL & query optimization
- PySpark / Spark concepts & coding
- ETL/ELT pipeline design
- Data modeling & architecture
- Cloud tools (Azure ADF, Databricks, Snowflake, etc.)
- Behavioral + project deep-dives

Interviews typically include 2–3 technical rounds + managerial round. Interviewers test not just “how” but “why” — they want clear explanations of trade-offs, performance considerations, and production best practices.

This guide covers the most common questions with answers, examples, and tips. It builds on topics like GROUP BY, CTEs, Indexes, Execution Plans, Views/Materialized Views, and Connection Pooling.

1. SQL Questions (Always Asked)

1. Difference between WHERE and HAVING? Give an example.

- `WHERE` filters rows before aggregation.
- `HAVING` filters groups after `GROUP BY`.

Example:
```sql
SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2023-01-01'          -- Filters rows first
GROUP BY department
HAVING AVG(salary) > 80000 
   AND COUNT() > 10;                    -- Filters groups
```

2. Write a query to find the 2nd highest salary per department (without LIMIT/TOP).

```sql
WITH ranked AS (
    SELECT 
        department,
        salary,
        DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
    FROM employees
)
SELECT department, salary AS second_highest
FROM ranked
WHERE rnk = 2;
```

3. Explain Window Functions. When do you use them over GROUP BY?

Window functions perform calculations across a set of rows related to the current row (without collapsing rows).  
Use cases: running totals, moving averages, rank, lag/lead.

Example (cumulative sales):
```sql
SELECT 
    order_date,
    SUM(total_amount) OVER (ORDER BY order_date) AS running_total
FROM orders;
```

4. How do you optimize a slow GROUP BY query on a large table?

- Add composite indexes on `GROUP BY` columns (most selective first).
- Filter early with `WHERE` instead of `HAVING`.
- Check execution plan for Hash Aggregate vs Stream Aggregate.
- Use covering indexes or materialized views for frequent reports.
- Consider approximate functions for very large datasets.

Other common SQL topics:
- Self-join (employees earning more than manager)
- Handling duplicates with `ROW_NUMBER()`
- CTE vs Subquery vs Temp Table
- Normalization vs Denormalization

2. PySpark / Spark Questions (Heavily Tested)

1. RDD vs DataFrame vs Dataset — When to use each?

- RDD: Low-level, full control, no schema. Use for custom transformations or when you need fine-grained control.
- DataFrame: Structured, optimized (Catalyst + Tungsten). Default choice for most ETL.
- Dataset: Type-safe (Scala/Java). Good for compile-time safety in strongly-typed languages.

2. repartition() vs coalesce() — Which is better and why?

- `repartition(n)`: Full shuffle, evenly redistributes data. Use when you need to increase partitions or fix skew.
- `coalesce(n)`: Avoids full shuffle (merges partitions). Use only to decrease partitions before write.

3. Explain Spark Lazy Evaluation and its benefits.

Transformations (map, filter, groupBy) are lazy — they build a DAG but don’t execute until an action (count, write, show) is called.  
Benefits: Optimizer can combine operations, avoid unnecessary computation, and choose best execution plan.

4. How do you handle data skew in Spark joins?

- Identify skewed keys using `groupBy` + `count`.
- Salting technique: Add random suffix to skewed keys.
- Broadcast smaller table if possible.
- Increase partitions or use `repartitionByRange`.

5. Common PySpark coding tasks:
- Word count / daily aggregates
- Flatten nested JSON
- Handle nulls & duplicates
- Window functions in PySpark (`row_number`, `lag`, `lead`)
- Incremental load using Delta Lake

3. ETL, Pipeline & Architecture Questions

1. Design an ETL pipeline for ingesting terabytes of daily data (common Deloitte scenario).

Key points to cover:
- Extract: Use Azure Data Factory / Kafka / CDC (Debezium) for incremental load.
- Transform: PySpark on Databricks (handle schema evolution, cleansing, enrichment).
- Load: Write to Delta Lake (ACID, time travel, schema enforcement).
- Orchestration: Apache Airflow or ADF.
- Partitioning strategy, error handling, monitoring, and idempotency.

2. Data Warehouse vs Data Lake vs Lakehouse?

- DW: Structured, schema-on-write, good for BI (Snowflake, Synapse).
- Data Lake: Schema-on-read, cheap storage for raw data (ADLS, S3).
- Lakehouse: Combines both using Delta Lake / Iceberg (ACID + BI support).

3. What is Delta Lake? Why use it?

Delta Lake adds reliability to data lakes:
- ACID transactions
- Schema enforcement & evolution
- Time travel & versioning
- Optimized performance (Z-ordering, partitioning)

4. Batch vs Streaming pipelines — When to use each?

- Batch: Scheduled (daily/hourly), simpler, higher throughput. Use for reporting.
- Streaming: Near real-time (Spark Structured Streaming / Kafka). Use for fraud detection, alerts.

4. Cloud, Orchestration & Best Practices

- Explain Apache Airflow components (DAG, Scheduler, Executor, Web UI).
- How do you monitor and handle failing production jobs?
- What is Change Data Capture (CDC)? Tools: Debezium, Kafka.
- How do you ensure data quality in pipelines? (Great Expectations, custom checks)
- Connection Pooling in ETL jobs (why important for source databases).

5. Behavioral & Project Questions

- Walk me through your most challenging data pipeline project.
- How did you optimize a slow Spark job?
- How do you handle conflicts with business/data science teams?
- Describe a time you implemented incremental loading instead of full refresh.

Preparation Tips for Deloitte Data Engineer Interview

1. Revise your SQL series — Execution plans, indexes, GROUP BY optimization, CTEs, materialized views are frequently discussed.
2. Practice PySpark coding on Databricks Community Edition or local Spark.
3. Be ready to draw architecture diagrams (ingestion → processing → serving).
4. Focus on production aspects: monitoring, error handling, cost optimization, idempotency.
5. Know at least one cloud deeply (Azure is common at Deloitte: ADF + Databricks + Synapse).
6. Prepare STAR stories for behavioral questions.

Expected Salary (India, 2–5 yrs): 12–25+ LPA (depends on skills, location, and negotiation).

Conclusion

Deloitte Data Engineer interviews test your ability to design reliable, scalable, and maintainable data systems. Strong fundamentals in SQL optimization, PySpark performance tuning, and end-to-end pipeline thinking give you a clear advantage.

Master the concepts from your earlier articles (Indexes, Execution Plans, GROUP BY, Connection Pooling) and combine them with Spark & ETL knowledge for success.

Pro Tip: Always explain trade-offs — “I chose repartition over coalesce because…” or “I used a materialized view here for performance because queries were read-heavy.”

Good luck with your Deloitte preparation!