Builder Parallelization
Applies To: |
Pipeline Bundle |
Configuration Scope: |
Global Pipeline Configuration |
Databricks Docs: |
NA |
Overview
The Lakeflow Framework supports parallel processing during both dataflow specification building and pipeline initialization phases to improve performance and reduce initialization time. This feature utilizes ThreadPoolExecutor to process multiple operations concurrently, which is particularly beneficial for:
Large pipelines with many dataflow specifications
Complex dataflow specifications requiring validation and transformation
The framework automatically detects the number of logical CPU cores available on the Spark driver using os.cpu_count() and sets the default max workers to cores - 1 to reserve one core for system operations. This ensures optimal performance while maintaining system stability. If CPU core detection fails, the framework falls back to a default of 1 worker thread.
Parameters
Parameter |
Type |
Default Value |
Phase |
Description |
|---|---|---|---|---|
|
Integer |
NA |
Pipeline Initialization |
Should only be used if the auto-detected default is not working. Controls the maximum number of worker threads used when:
|
|
Boolean |
False |
Pipeline Initialization |
Disables threading when creating DataFlow objects |
Configuration
Global Configuration
Configure these parameters globally for all pipelines in your src/config/global.json|yaml file:
{
"spark_config": {
"spark.databricks.sql.streamingTable.cdf.applyChanges.returnPhysicalCdf": true,
"pipelines.streamingFlowReadOptionsEnabled": true,
"pipelines.externalSink.enabled": true
},
"override_max_workers": 4,
"mandatory_table_properties": {
"delta.logRetentionDuration": "interval 45 days",
"delta.deletedFileRetentionDuration": "interval 45 days",
"delta.enableRowTracking": "true"
}
}
spark_config:
spark.databricks.sql.streamingTable.cdf.applyChanges.returnPhysicalCdf: true
pipelines.streamingFlowReadOptionsEnabled: true
pipelines.externalSink.enabled: true
override_max_workers: 4
mandatory_table_properties:
delta.logRetentionDuration: interval 45 days
delta.deletedFileRetentionDuration: interval 45 days
delta.enableRowTracking: 'true'
Troubleshooting
Debugging Core Detection:
The framework logs the detected core count and calculated default max workers during initialization:
INFO - Logical cores (threads): 4
INFO - Default max workers: 3