Build the First Pipeline

You'll learn how Lakeflow and Spark Declarative Pipelines work and build your first transformation pipeline in ~20 min.

Prereqs: Access your data

Why this matters

Data is connected, governance is in place, but raw data sitting in storage does not answer business questions. Pipelines transform raw ingested data into clean, reliable tables that analysts and dashboards can trust. Databricks uses Spark Declarative Pipelines (SDP) for this — you declare what you want the data to look like, and the engine handles how to get there incrementally.

Journey checklist

Data engineering on Databricks

Databricks data engineering is built on three pillars. Each handles a different stage of the data lifecycle.

Pillar	Purpose	Where to learn more
Lakeflow Connect	Ingest data from external sources into UC tables	Access your data — ingestion pipelines
Spark Declarative Pipelines	Transform raw data through bronze, silver, and gold layers	This section
Lakeflow Jobs	Orchestrate pipelines, notebooks, and tasks on a schedule	Automation and orchestration

Lakeflow introduction

Lakeflow in action

How it works — Spark Declarative Pipelines

SDP is the Databricks implementation of Apache Spark Declarative Pipelines. You write Python (PySpark) or SQL declarations, and the engine figures out the execution plan, dependency order, and incremental processing.

Key capabilities:

Multi-source processing — stream from Kafka, batch-load from cloud storage, or query external databases in the same pipeline.
Built-in incremental processing — SDP tracks changes and processes only new or modified data, reducing compute cost and runtime.
Declarative data quality — define expectations (constraints and business rules) inline with transformations. Rows that fail expectations are quarantined or flagged automatically.
Unity Catalog integration — every table, view, and pipeline is governed by UC.

SDP reference material

Read these before writing your first pipeline:

Load data in pipelines — includes AutoLoader for incremental ingestion from cloud object storage.
Transform data with pipelines — when to use views, materialized views, and streaming tables.
Manage data quality with expectations — data quality constraints and business rules.
Python: Python language reference | Develop SDP with Python
SQL: SQL language reference | Develop SDP with SQL

Create the first pipeline

Two paths to build your first pipeline — pick the one that matches your workflow:

Workspace + Genie Code — Build a pipeline interactively using the Databricks Workspace and Genie Code.
DABs — Deploy a complete medallion pipeline as code using Databricks Asset Bundles.

Do next: Workspace + Genie Code
Learn why: Data Governance Strategy
Reference: Spark Declarative Pipelines — Databricks docs

Why this matters
Journey checklist
Data engineering on Databricks
- Lakeflow introduction
- Lakeflow in action
How it works — Spark Declarative Pipelines
SDP reference material
Create the first pipeline
Next

Why this matters​

Journey checklist​

Data engineering on Databricks​

Lakeflow introduction​

Lakeflow in action​

How it works — Spark Declarative Pipelines​

SDP reference material​

Create the first pipeline​

Next​