How I Use SQL in My Modern Data Engineering Projects

Here’s where I actually use SQL in my daily data engineering projects

Nov 04, 2025

Hey friends - Happy Tuesday!

People often think that once you become a senior data engineer working with big data platforms, Spark, and orchestration tools, you stop using SQL.

It sounds logical at first. SQL feels simple, like something for beginners or small databases.

But that idea is completely wrong.

The truth is, the bigger and more complex a system gets, the more I rely on SQL. I use it almost every day.

So in this post, I’ll show you where we actually use SQL in real big data engineering projects.

1. SQL in Data Exploration

Whenever I connect to a new system, the first thing I do is look around.
I want to understand how the data is structured, what tables exist, how they connect, and what kind of content they hold.

To explore, I use SQL.
I run small queries to check columns, data types, record counts, and sample a few rows.
It feels similar to what data analysts do in exploratory data analysis, just from a data engineering angle.

This first step helps me understand data quality, potential issues, and which joins or transformations will make sense later.
Skipping this step is like building a house without checking the foundation.

2. SQL in Data Extraction

After exploring the system, the next step is to pull the data out.
That’s the extraction phase.

Here again, we use SQL to write queries that pull data from source systems.

Sometimes it’s more complex, involving joins or incremental logic to capture only the latest changes.

These queries usually live inside ETL pipelines and are often among the first steps in the process.

Even when you write code in Python, it’s still SQL running underneath. Python just sends the query to the database.
At this stage, SQL is what actually moves the data from where it lives to where it’s needed.

3. SQL in Data Transformation (Silver Layer)

Once the data is extracted, it’s rarely ready to use.
Raw data comes with missing values, inconsistent formats, and messy types.
The Silver Layer is where we clean and standardize everything.

This is usually where pipelines slow down.
When you work with large-scale data, you need distributed processing.
That’s why most teams use Spark engines like Databricks with PySpark.

Python gives flexibility with loops, variables, and parameters, but the actual cleaning logic can still be written in SQL.
With PySpark SQL, you can apply functions like TRIM, COALESCE, UPPER, LOWER, or CAST while Spark takes care of scaling the work.

Even if your workflow runs in Python, the core transformations still run in SQL.
That mix gives you Python for control and SQL for clarity and speed.

If you want to go deeper into SQL, I made a 30-Hours SQL course on YouTube where I shared everything I’ve learned from real projects using clear, animated visuals.
In just six months, it reached 1.7 million views.
It’s basically all my SQL experience packed into one short, practical lesson.

4. SQL in Data Modeling (Gold Layer)

When I reach the Gold Layer, things start to get heavier.
This is where the business logic comes in and where I start modeling the data for real use.

Here, I build fact and dimension tables, apply business rules, and calculate metrics that reflect how the business actually works.
And again, I use SQL for that.

With PySpark SQL, I join tables, group data, and write CASE WHEN statements to build business logic.
Python stays around for flexibility, but SQL does the heavy lifting.

5. SQL as a Universal Language

After years of working in data, I’ve realized that SQL is more than just a query language.
It’s how data teams communicate.

When engineers, analysts, and data scientists discuss logic or KPIs, SQL is the common ground.
It’s clear, readable, and easy to follow.
You can see how data is joined, filtered, and aggregated in one place.

Even in AI and machine learning projects, the data that feeds the models almost always starts from SQL transformations upstream.
Tools change, frameworks evolve, but SQL stays the same.
It remains the one language everyone in data understands and trusts.

6. SQL Everywhere

Almost every modern data tool now supports SQL in some way because it makes them easier to use and faster to adopt.

PySpark – lets you run transformations using PySpark SQL
Databricks – you can build most of the platform with SQL
Kafka (ksqlDB) – allows querying and processing streaming data
Azure Log Analytics (Kusto) – uses a SQL-like query language for logs
Snowflake – built entirely around SQL
BigQuery – fully driven by SQL for large-scale analytics
Power BI, Tableau, and Looker – rely on SQL to query and visualize data
Elasticsearch – supports SQL queries on top of search indexes
Jira and Notion – include SQL-style filters and queries
Data observability tools like Monte Carlo and Datafold – use SQL for validation and monitoring

No matter what platform you use, SQL is always there behind the scenes.

7. SQL in Many Other Daily Tasks

SQL quietly supports dozens of other workflows that keep real data platforms running:

Data validation and quality checks to make sure pipelines load correctly
Testing and monitoring with tools like dbt, Great Expectations, or Soda
Performance tuning by analyzing execution plans and optimizing joins
Reporting and dashboards through SQL-based models in BI tools
Data lineage and auditing where tools read SQL to track dependencies
Automation and scheduling since Airflow, ADF, and Databricks all trigger SQL jobs daily

It’s not just for querying tables.
SQL is the backbone of nearly every process in a data platform.

So my Friends…

So my friends, after all these years working with big data platforms, Spark, and orchestration tools, I can tell you one thing with confidence.
You never move away from SQL. You actually get closer to it.

It shows up in every part of the job.
You use it to explore new systems, extract and clean data, model the business logic, validate pipelines, and build reports.
It’s in Databricks, Kafka, Azure, Tableau, and almost every other tool that touches data.

SQL is what keeps the entire data ecosystem connected.
It’s how engineers, analysts, and data scientists understand each other and make sense of complex systems together.

So don’t listen to anyone who says SQL is dead or outdated.
Most of the time, they’ve never worked on a real production data platform.
SQL isn’t going anywhere. It’s still the language that holds everything together.

Thanks for reading. ❤️
Baraa

News

🚨 I’ve been working quietly for months… and now it’s finally time.
This 𝗧𝗵𝘂𝗿𝘀𝗱𝗮𝘆 𝗮𝘁 𝟱 𝗣𝗠 (𝗖𝗘𝗧) I’m going LIVE on YouTube!
I’ll be sharing 𝟳 𝗯𝗶𝗴 𝗮𝗻𝗻𝗼𝘂𝗻𝗰𝗲𝗺𝗲𝗻𝘁𝘀
🔔 Make sure you’re subscribed to Data With Baraa so you don’t miss it.
(Live link coming soon)

This week is about the most elegant, efficient, smart, and yet beautifully simple concept in Python data structures!
𝗟𝗶𝘀𝘁 𝗖𝗼𝗺𝗽𝗿𝗲𝗵𝗲𝗻𝘀𝗶𝗼𝗻 💡
You can loop, filter, and transform data ... all in one line! How awesome?

Also, here are 3 complete roadmap videos if you’re figuring out where to start:

📌 Data Engineering Roadmap
📌 Data Science Roadmap
📌 Data Analyst Roadmap

Hey friends —

I’m Baraa. I’m an IT professional and YouTuber.

My mission is to share the knowledge I’ve gained over the years and to make working with data easier, fun, and accessible to everyone through courses that are free, simple, and easy!

Visit My Youtube Channel

Data with Baraa

Discussion about this post