Supported Source Types

Applies To:

Pipeline Bundle

Configuration Scope:

Data Flow Spec

Databricks Docs:

https://docs.databricks.com/en/delta-live-tables/python-ref.html

The Lakeflow Framework supports multiple source types. Each source type provides specific configuration options to handle different data ingestion scenarios.

Source Types

Type

Description

Key Features

Batch Files

Reads data from UC Volumes orcloud storage locations (e.g., S3, ADLS, GCS). Supports various file formats and provides options for filtering and transforming data during ingestion.

  • Flexible path-based file access

  • Reader options for different file formats

  • Optional select expressions and where clauses

  • Schema on read support

Cloud Files

Reads data from UC Volumes or cloud storage locations (e.g., S3, ADLS, GCS). Supports various file formats and provides options for filtering and transforming data during ingestion.

  • Flexible path-based file access

  • Reader options for different file formats

  • Optional select expressions and where clauses

  • Schema on read support

Delta

Connects to existing Delta tables in the metastore, supporting both batch and streaming reads with change data feed (CDF) capabilities.

  • Database and table-based access

  • Change Data Feed (CDF) support

  • Optional path-based access

  • Configurable reader options

Delta Join

Enables joining multiple Delta tables, supporting both streaming and static join patterns.

  • Multiple source table configuration

  • Stream and static join modes

  • Left and inner join support

  • Flexible join conditions

  • Per-source CDF configuration

Kafka

Enables reading from Apache Kafka topics for real-time streaming data processing.

  • Kafka-specific reader options

  • Schema definition support

  • Filtering and transformation support

  • Topic-based configurations

  • Demux and fan-out support

Python

Allows using a Python function as a data source, providing flexibility for complex data transformations.

  • Python file-based configuration

  • Functions stored in python_functions subdirectory

  • Full Python / Pyspark capabilities

  • Detailed configuration details: Python Source

SQL

Allows using SQL queries as data sources, providing flexibility for complex data transformations.

  • SQL file-based configuration

  • Queries stored in dml subdirectory

  • Full SQL transformation capabilities

  • Detailed configuration details: SQL Source

General Data Flow Spec Configuration

Set as an attribute when creating your Data Flow Spec, refer to the Data Flow Spec Reference documentation for more information:

Detailed Source Type Configuration Details