Build the First Pipeline
You'll learn how Lakeflow and Spark Declarative Pipelines work and build your first transformation pipeline in ~20 min.
Prereqs: Access your data
Why this matters
Data is connected, governance is in place, but raw data sitting in storage does not answer business questions. Pipelines transform raw ingested data into clean, reliable tables that analysts and dashboards can trust. Databricks uses Spark Declarative Pipelines (SDP) for this — you declare what you want the data to look like, and the engine handles how to get there incrementally.
Journey checklist
-
Get started. -
Before you start. -
Infra setup. -
Cost monitoring. -
Data Governance Strategy. -
Access your data. - Build the first pipeline.
- Automation and orchestration.
- Query and explore.
- Databricks AI/BI.
- Business semantics.
Data engineering on Databricks
Databricks data engineering is built on three pillars. Each handles a different stage of the data lifecycle.
| Pillar | Purpose | Where to learn more |
|---|---|---|
| Lakeflow Connect | Ingest data from external sources into UC tables | Access your data — ingestion pipelines |
| Spark Declarative Pipelines | Transform raw data through bronze, silver, and gold layers | This section |
| Lakeflow Jobs | Orchestrate pipelines, notebooks, and tasks on a schedule | Automation and orchestration |
Lakeflow introduction
Lakeflow in action
How it works — Spark Declarative Pipelines
SDP is the Databricks implementation of Apache Spark Declarative Pipelines. You write Python (PySpark) or SQL declarations, and the engine figures out the execution plan, dependency order, and incremental processing.
Key capabilities:
- Multi-source processing — stream from Kafka, batch-load from cloud storage, or query external databases in the same pipeline.
- Built-in incremental processing — SDP tracks changes and processes only new or modified data, reducing compute cost and runtime.
- Declarative data quality — define expectations (constraints and business rules) inline with transformations. Rows that fail expectations are quarantined or flagged automatically.
- Unity Catalog integration — every table, view, and pipeline is governed by UC.
SDP reference material
Read these before writing your first pipeline:
- Load data in pipelines — includes AutoLoader for incremental ingestion from cloud object storage.
- Transform data with pipelines — when to use views, materialized views, and streaming tables.
- Manage data quality with expectations — data quality constraints and business rules.
- Python: Python language reference | Develop SDP with Python
- SQL: SQL language reference | Develop SDP with SQL
Create the first pipeline
Two paths to build your first pipeline — pick the one that matches your workflow:
- Workspace + Genie Code — Build a pipeline interactively using the Databricks Workspace and Genie Code.
- DABs — Deploy a complete medallion pipeline as code using Databricks Asset Bundles.
Next
- Do next: Workspace + Genie Code
- Learn why: Data Governance Strategy
- Reference: Spark Declarative Pipelines — Databricks docs