Skip to main content

7. Build the first ETL pipeline

You'll learn how Lakeflow and Spark Declarative Pipelines work and build your first transformation pipeline in ~20 min.

Prereqs: Access your data

Why this matters

Raw data in storage answers nothing on its own. The job of a pipeline is to turn it into clean tables that an analyst or a dashboard can actually trust. On Databricks you do that with Spark Declarative Pipelines (SDP): you declare what the tables should look like, and the engine works out how to build them and how to keep them up to date as new data lands.

The payoff of "declarative" is that you stop hand-writing the plumbing. You no longer maintain checkpoints by hand or work out which rows are new. You describe the result, and the engine handles incremental processing and dependency order for you.

Journey checklist

  • Get started.
  • Before you start.
  • Infra setup.
  • Cost monitoring.
  • Data Governance Strategy.
  • Access your data.
  • Build the first pipeline.
  • Automation and orchestration.
  • Query and explore.
  • Databricks AI/BI.
  • Business semantics.

Data engineering on Databricks

Three Databricks tools split the work, each at a different stage of the data's life: one to ingest, one to transform, one to schedule. This section is about the transform step, but it helps to see where it sits.

ToolWhat it doesWhere to learn more
Lakeflow ConnectPulls data from external sources into UC tablesAccess your data: Databases and SaaS ingestion
Spark Declarative PipelinesTransforms raw data through bronze, silver, and gold layersThis section
Lakeflow JobsRuns pipelines, notebooks, and tasks on a scheduleAutomation and orchestration

Lakeflow introduction

Lakeflow in action

How it works: Spark Declarative Pipelines

SDP is the Databricks implementation of Apache Spark Declarative Pipelines. You write Python (PySpark) or SQL declarations. The engine reads them, works out the execution plan and the order tables depend on each other, and runs the whole thing incrementally.

For core concepts (streaming tables, materialized views, flows, expectations) and why SDP over plain Spark, see What is Lakeflow Spark Declarative Pipelines.

SDP reference material

Awesome pages to have open while you build:

Create the first pipeline

Pick by where you are right now, not by which is "better."

PathWhat it doesWhen to use it
Hands-on labInstall a pre-built demo pipeline with dashboardsYou want to see a working pipeline before writing one yourself
Workspace + Genie CodeBuild interactively in the browserYou are still figuring out the transformation
DABsDeploy a full medallion pipeline as codeThe shape is settled and you want it reproducible

Next