7. Build the first ETL pipeline

You'll learn how Lakeflow and Spark Declarative Pipelines work and build your first transformation pipeline in ~20 min.

Prereqs: Access your data

Why this matters

Raw data in storage answers nothing on its own. The job of a pipeline is to turn it into clean tables that an analyst or a dashboard can actually trust. On Databricks you do that with Spark Declarative Pipelines (SDP): you declare what the tables should look like, and the engine works out how to build them and how to keep them up to date as new data lands.

The payoff of "declarative" is that you stop hand-writing the plumbing. You no longer maintain checkpoints by hand or work out which rows are new. You describe the result, and the engine handles incremental processing and dependency order for you.

Journey checklist

Data engineering on Databricks

Three Databricks tools split the work, each at a different stage of the data's life: one to ingest, one to transform, one to schedule. This section is about the transform step, but it helps to see where it sits.

Tool	What it does	Where to learn more
Lakeflow Connect	Pulls data from external sources into UC tables	Access your data: Databases and SaaS ingestion
Spark Declarative Pipelines	Transforms raw data through bronze, silver, and gold layers	This section
Lakeflow Jobs	Runs pipelines, notebooks, and tasks on a schedule	Automation and orchestration

Lakeflow introduction

Lakeflow in action

How it works: Spark Declarative Pipelines

SDP is the Databricks implementation of Apache Spark Declarative Pipelines. You write Python (PySpark) or SQL declarations. The engine reads them, works out the execution plan and the order tables depend on each other, and runs the whole thing incrementally.

For core concepts (streaming tables, materialized views, flows, expectations) and why SDP over plain Spark, see What is Lakeflow Spark Declarative Pipelines.

SDP reference material

Awesome pages to have open while you build:

Load data in pipelines, which covers AutoLoader for incremental ingestion from cloud object storage.
Transform data with pipelines: when to reach for views, materialized views, or streaming tables.
Manage data quality with expectations for constraints and business rules.
Python: Python language reference | Develop SDP with Python
SQL: SQL language reference | Develop SDP with SQL

Create the first pipeline

Pick by where you are right now, not by which is "better."

Path	What it does	When to use it
Hands-on lab	Install a pre-built demo pipeline with dashboards	You want to see a working pipeline before writing one yourself
Workspace + Genie Code	Build interactively in the browser	You are still figuring out the transformation
DABs	Deploy a full medallion pipeline as code	The shape is settled and you want it reproducible

Do next: Hands-on lab
Learn why: Data Governance Strategy
Reference: Spark Declarative Pipelines (Databricks docs)

Why this matters
Journey checklist
Data engineering on Databricks
- Lakeflow introduction
- Lakeflow in action
How it works: Spark Declarative Pipelines
SDP reference material
Create the first pipeline
Next

Why this matters​

Journey checklist​

Data engineering on Databricks​

Lakeflow introduction​

Lakeflow in action​

How it works: Spark Declarative Pipelines​

SDP reference material​

Create the first pipeline​

Next​