7. Build the first ETL pipeline
You'll learn how Lakeflow and Spark Declarative Pipelines work and build your first transformation pipeline in ~20 min.
Prereqs: Access your data
Why this matters
Raw data in storage answers nothing on its own. The job of a pipeline is to turn it into clean tables that an analyst or a dashboard can actually trust. On Databricks you do that with Spark Declarative Pipelines (SDP): you declare what the tables should look like, and the engine works out how to build them and how to keep them up to date as new data lands.
The payoff of "declarative" is that you stop hand-writing the plumbing. You no longer maintain checkpoints by hand or work out which rows are new. You describe the result, and the engine handles incremental processing and dependency order for you.
Journey checklist
-
Get started. -
Before you start. -
Infra setup. -
Cost monitoring. -
Data Governance Strategy. -
Access your data. - Build the first pipeline.
- Automation and orchestration.
- Query and explore.
- Databricks AI/BI.
- Business semantics.
Data engineering on Databricks
Three Databricks tools split the work, each at a different stage of the data's life: one to ingest, one to transform, one to schedule. This section is about the transform step, but it helps to see where it sits.
| Tool | What it does | Where to learn more |
|---|---|---|
| Lakeflow Connect | Pulls data from external sources into UC tables | Access your data: Databases and SaaS ingestion |
| Spark Declarative Pipelines | Transforms raw data through bronze, silver, and gold layers | This section |
| Lakeflow Jobs | Runs pipelines, notebooks, and tasks on a schedule | Automation and orchestration |
Lakeflow introduction
Lakeflow in action
How it works: Spark Declarative Pipelines
SDP is the Databricks implementation of Apache Spark Declarative Pipelines. You write Python (PySpark) or SQL declarations. The engine reads them, works out the execution plan and the order tables depend on each other, and runs the whole thing incrementally.
For core concepts (streaming tables, materialized views, flows, expectations) and why SDP over plain Spark, see What is Lakeflow Spark Declarative Pipelines.
SDP reference material
Awesome pages to have open while you build:
- Load data in pipelines, which covers AutoLoader for incremental ingestion from cloud object storage.
- Transform data with pipelines: when to reach for views, materialized views, or streaming tables.
- Manage data quality with expectations for constraints and business rules.
- Python: Python language reference | Develop SDP with Python
- SQL: SQL language reference | Develop SDP with SQL
Create the first pipeline
Pick by where you are right now, not by which is "better."
| Path | What it does | When to use it |
|---|---|---|
| Hands-on lab | Install a pre-built demo pipeline with dashboards | You want to see a working pipeline before writing one yourself |
| Workspace + Genie Code | Build interactively in the browser | You are still figuring out the transformation |
| DABs | Deploy a full medallion pipeline as code | The shape is settled and you want it reproducible |
Next
- Do next: Hands-on lab
- Learn why: Data Governance Strategy
- Reference: Spark Declarative Pipelines (Databricks docs)