6. Build the first pipeline
info
- Create the first spark declarative pipeline on Databricks.
Journey checklist
-
Identify target cloud tenant(s). -
Infra setup. -
Data Governance Strategy. -
Access your data. - Build the first pipeline.
- Automation and orchestration.
- Query and explore.
- Databricks AI/BI
Data Engineering on Databricks
Lakeflow introduction
Lakeflow in action
Lessons learned from the previous videos
For data ingestion
- Use Lakeflow Connect.
For data transformation
- Use Spark Declarative Pipelines.
For orchestration
- Use Lakeflow Jobs.
Spark Declarative Pipelines
- Apache Spark 4.1.1 Spark Declarative Pipelines (SDP) documentation.
- Python Pyspark and SQL support.
- Process multiple sources simultaneously whether streaming from Kafka, batch loading from cloud storage, or querying external databases.
- Built-in incremental processing intelligently tracks changes and processes only new or modified data, dramatically reducing compute costs and pipeline runtimes.
- Data quality is enforced through declarative expectations that you define inline with your transformations.
- Integrated with Unity Catalog (everything on Databricks is Unity Catalog).
SDP Features
Technical references before coding 🛡️
- Load data in pipelines.
- AutoLoader for incremental ingestions for data sitting in Cloud Object Storage.
- Transform data with pipelines
.
- When to use views, materialized views, and streaming tables!
- Manage data quality with pipeline expectations
.
- Data quality constraint and business rules defined as expectations.
- Python
- SQL
Create the first pipeline 🛠️
- UI + Databricks Agent – Build a pipeline using the Databricks UI and Databricks Agent.
- DABs – Build a pipeline using Databricks Asset Bundles.
- MCP skills – Build a pipeline using MCP skills.