Change Data Capture (CDC) Configuration
The cdcSettings and cdcSnapshotSettings enable and pass configuration info to the CDC API’s.
Field |
Type |
Description |
|---|---|---|
cdcSettings |
|
See cdcSettings for more information. |
cdcSnapshotSettings |
|
See cdcSnapshotSettings for more information. |
cdcSettings
The cdcSettings object contains the following properties:
Parameter |
Type |
Description |
|---|---|---|
keys |
|
The column or combination of columns that uniquely identify a row in the source data. This is used to identify which CDC events apply to specific records in the target table. |
sequence_by |
str |
The column name specifying the logical order of CDC events in the source data. Delta Live Tables uses this sequencing to handle change events that arrive out of order. |
scd_type |
|
Whether to store records as SCD type 1 or SCD type 2. Set to |
apply_as_deletes |
|
(optional) Specifies when a CDC event should be treated as a DELETE rather than an upsert. |
where |
|
(optional) Filter the rows by a condition. |
ignore_null_updates |
|
(optional) Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and ignore_null_updates is True, columns with a null retain their existing values in the target. This also applies to nested columns with a value of null. When ignore_null_updates is False, existing values are overwritten with null values. |
except_column_list |
|
(optional) A list of columns to exclude from the upsert into the target table. |
track_history_column_list
track_history_except_column_list
|
|
A subset of output columns to be tracked for history in the target table. Use track_history_column_list to specify the complete list of columns to be tracked. Use track_history_except_column_list to specify the columns to be excluded from tracking. |
cdcSnapshotSettings
The cdcSnapshotSettings object contains the following properties:
CDC Historical Snapshot Source Configuration
The
sourceobject contains the following properties forfilebased sources:
Parameter
Type
Description
format
stringThe format of the source data. E.g. supported formats are
table,parquet,csv,json. All formats supported by spark see PySpark Data Sources API.path
stringThe location to load the source data from. This can be a table name or a path to a file or directory with multiple snapshots. Supports three path pattern styles for version extraction: the
{version}placeholder (simple single-segment match), the{fragment}placeholder (for multi-file snapshots), and regex named capture groups (for complex partitioning). See File Path Patterns for details and examples.versionType
stringThe type of versioning to use. Can be either
intordatetime.datetimeFormat
string(conditional) Required if
versionTypeisdatetime. The format ofstartingVersiondatetime value.microSecondMaskLength
integer(optional) WARNING: Edge Cases Only! - Specify this if your
versionTypeisdatetimeand your filename includes microsends, but not the full 6 digits. The number of microsecond digits to included at the end of the datetime value. - The default value is 6.startingVersion
stringorinteger(optional) The version to start processing from.
readerOptions
object(optional) Additional options to pass to the reader.
schemaPath
string(optional) The schema path to use for the source data.
selectExp
list(optional) A list of select expressions to apply to the source data.
filter
string(optional) A filter expression to apply to the source data. This filter is applied to the dataframe as a WHERE clause when the source is read. The placeholder
{version}can be used in this filter expression and will be substituted with the version value at run time (e.g."year = '{version}'"). Not applicable when using regex named capture groups inpath.recursiveFileLookup
boolean(optional) When set to
true, enables recursive directory traversal to find snapshot files. This should be used when snapshots are stored in a nested directory structure such as Hive-style partitioning (e.g.,/data/{version}/file.parquet). When set tofalse(default), only files in the immediate directory are searched. Default:false.Note
If
recursiveFileLookupis set totrue, ensure that thepathparameter is compatible with recursive directory traversal. When using the{version}placeholder, place it in the directory portion of the path rather than the filename (e.g./data/{version}/file.parquet). When using regex named capture groups, the pattern spans the full relative path from the first dynamic segment, sorecursiveFileLookupmust betrueif the version spans multiple directory levels.
File Path Patterns
The
pathfield supports three styles for expressing where the version (and optional fragment) appears in the file path. All styles can be combined with a static base path prefix that is resolved at run time (e.g.{sample_file_location}).
Style
Syntax
When to Use
{version}placeholder
{version}Version is contained in a single path segment or filename component. Simple and readable for flat or single-level partitioned layouts.
{fragment}placeholder
{fragment}Snapshot data for a single version is split across multiple files. Use alongside
{version}to group files sharing the same version together.Regex named capture groups
(?P<version_<name>>.+)Version is spread across multiple path segments or interleaved with other text. Supports complex partitioning schemes (e.g. Hive-style
YEAR=.../MONTH=.../DAY=...) where the version cannot be expressed as a single placeholder.``{version}`` — single-segment version
The
{version}placeholder matches one path segment or filename component. It is internally converted to a regex named capture group(?P<version_main>.+).{ "path": "/mnt/data/snapshots/customer_{version}.csv", "versionType": "timestamp", "datetimeFormat": "%Y_%m_%d" }Files matched:
customer_2024_01_01.csv,customer_2024_01_02.csv, …For directory-partitioned layouts, place
{version}in the directory portion and setrecursiveFileLookuptotrue:{ "path": "/mnt/data/snapshots/{version}/customer.csv", "versionType": "timestamp", "datetimeFormat": "YEAR=%Y/MONTH=%m/DAY=%d", "recursiveFileLookup": true }Files matched:
YEAR=2024/MONTH=01/DAY=01/customer.csv, …``{fragment}`` — multi-file snapshots
Use
{fragment}alongside{version}when a single snapshot version is split across multiple files. All files sharing the same version are read and unioned together before CDC processing.{ "path": "/mnt/data/snapshots/customer_{version}_split_{fragment}.csv", "versionType": "timestamp", "datetimeFormat": "%Y_%m_%d" }Files matched and grouped by version:
customer_2024_01_01_split_1.csv,customer_2024_01_01_split_2.csv→ both ingested as version2024-01-01.Regex named capture groups — multi-segment versions
For cases where the version is distributed across multiple directory levels or interleaved with fixed text, use Python regex named capture groups with the prefix
version_. All groups whose names start withversion_are extracted and concatenated in the order they appear in the pattern (left to right) to form the final version string, which is then parsed according todatetimeFormator treated as an integer.Group naming convention:
(?P<version_<name>>.+). The<name>suffix is arbitrary but must be unique within the pattern. The concatenation order is determined by the position of each group in the path expression, not the name.{ "path": "/mnt/data/snapshots/(?P<version_year>.+)/(?P<version_month>.+)/data/customer_(?P<version_day>.+).csv", "versionType": "timestamp", "datetimeFormat": "%Y%m%d", "recursiveFileLookup": true }For the file
2024/01/data/customer_15.csv, the groups are captured left-to-right:version_year=2024,version_month=01,version_day=15. These are concatenated in pattern order to produce"20240115", which is then parsed withdatetimeFormat: "%Y%m%d".Tip
Arrange your
(?P<version_...>)groups in the path from left to right in the same order that their values should be concatenated to match yourdatetimeFormat. The group names themselves only need to be unique — their order in the pattern controls concatenation.See
samples/bronze_sample/src/dataflows/feature_samples/dataflowspec/historical_snapshot_files_datetime_recursive_and_partitioned_regex_main.jsonfor a complete working example.The
sourceobject contains the following properties fortablebased sources:
Parameter
Type
Description
table
stringThe table name to load the source data from.
versionColumn
stringThe column name to use for versioning.
startingVersion
stringorinteger(optional) The version to start processing from.
selectExp
list(optional) A list of select expressions to apply to the source data.