Spark Configuration
Applies To: |
Framework Bundle Pipeline Bundle |
Configuration Scope: |
Global Pipeline |
Databricks Docs: |
The Spark Configuration feature allows you to define and manage Spark configurations either globally at framework level across all pipelines or at pipeline bundle level.
Configuration
src/config/global.json|yaml under the spark_config section.src/pipeline_configs/global.json|yaml under the spark_config section.Configuration Schema
The Spark configuration section must follow this schema:
{
"spark_config": {
"<configuration_key>": "<configuration_value>",
...
}
}
spark_config:
<configuration_key>: <configuration_value>
...
Field |
Description |
|---|---|
configuration_key |
The Spark configuration property name (e.g., “spark.sql.shuffle.partitions”) |
configuration_value |
The value to set for the configuration property. Can be string, number, or boolean depending on the property. |
Common Configuration Properties
Here are some commonly used Spark configuration properties:
Property |
Description |
Default Value |
|---|---|---|
spark.sql.shuffle.partitions |
Number of partitions to use for shuffle operations |
200 |
spark.sql.files.maxPartitionBytes |
Maximum size of a partition during file read |
128MB |
spark.sql.adaptive.enabled |
Enable adaptive query execution |
true |
spark.sql.broadcastTimeout |
Timeout in seconds for broadcast joins |
300 |
Example Configuration
Below is an example of a typical Spark configuration in the global.json|yaml file:
{
"spark_config": {
"spark.sql.shuffle.partitions": "200",
"spark.sql.adaptive.enabled": "true",
"spark.sql.files.maxPartitionBytes": "134217728",
"spark.sql.broadcastTimeout": "300"
}
}
spark_config:
spark.sql.shuffle.partitions: '200'
spark.sql.adaptive.enabled: 'true'
spark.sql.files.maxPartitionBytes: '134217728'
spark.sql.broadcastTimeout: '300'
Best Practice
Start with the default Spark configurations and adjust based on your specific workload needs
Monitor query performance and resource utilization to optimize configurations
Document any non-standard configuration changes and their rationale
Test configuration changes in development before applying to production
Note
Some Spark configurations may be overridden by Databricks cluster configurations or job-specific settings. Refer to the Databricks documentation for the configuration precedence rules.
Advanced Usage
Dynamic Configuration
For certain use cases, you may want to set different Spark configurations based on the environment or workload. This can be achieved using environment variables or the substitutions feature of the framework.
Example with environment-specific configurations:
{
"spark_config": {
"spark.sql.shuffle.partitions": "${SHUFFLE_PARTITIONS}",
"spark.sql.adaptive.enabled": "${ADAPTIVE_EXECUTION_ENABLED}"
}
}
spark_config:
spark.sql.shuffle.partitions: ${SHUFFLE_PARTITIONS}
spark.sql.adaptive.enabled: ${ADAPTIVE_EXECUTION_ENABLED}
Performance Tuning
When tuning Spark configurations for performance:
Start with the defaults
Monitor query performance and resource utilization
Identify bottlenecks
Adjust relevant configurations
Test and measure impact
Document successful optimizations
Important
Incorrect Spark configurations can significantly impact performance and stability. Always test configuration changes in a development environment first.