Skip to main content

Build the First Pipeline

You'll learn how Lakeflow and Spark Declarative Pipelines work and build your first transformation pipeline in ~20 min.

Prereqs: Access your data

Why this matters

Data is connected, governance is in place, but raw data sitting in storage does not answer business questions. Pipelines transform raw ingested data into clean, reliable tables that analysts and dashboards can trust. Databricks uses Spark Declarative Pipelines (SDP) for this — you declare what you want the data to look like, and the engine handles how to get there incrementally.

Journey checklist

  • Get started.
  • Before you start.
  • Infra setup.
  • Cost monitoring.
  • Data Governance Strategy.
  • Access your data.
  • Build the first pipeline.
  • Automation and orchestration.
  • Query and explore.
  • Databricks AI/BI.
  • Business semantics.

Data engineering on Databricks

Databricks data engineering is built on three pillars. Each handles a different stage of the data lifecycle.

PillarPurposeWhere to learn more
Lakeflow ConnectIngest data from external sources into UC tablesAccess your data — ingestion pipelines
Spark Declarative PipelinesTransform raw data through bronze, silver, and gold layersThis section
Lakeflow JobsOrchestrate pipelines, notebooks, and tasks on a scheduleAutomation and orchestration

Lakeflow introduction

Lakeflow in action

How it works — Spark Declarative Pipelines

SDP is the Databricks implementation of Apache Spark Declarative Pipelines. You write Python (PySpark) or SQL declarations, and the engine figures out the execution plan, dependency order, and incremental processing.

Key capabilities:

  • Multi-source processing — stream from Kafka, batch-load from cloud storage, or query external databases in the same pipeline.
  • Built-in incremental processing — SDP tracks changes and processes only new or modified data, reducing compute cost and runtime.
  • Declarative data quality — define expectations (constraints and business rules) inline with transformations. Rows that fail expectations are quarantined or flagged automatically.
  • Unity Catalog integration — every table, view, and pipeline is governed by UC.

SDP reference material

Read these before writing your first pipeline:

Create the first pipeline

Two paths to build your first pipeline — pick the one that matches your workflow:

  • Workspace + Genie Code — Build a pipeline interactively using the Databricks Workspace and Genie Code.
  • DABs — Deploy a complete medallion pipeline as code using Databricks Asset Bundles.

Next