Spark Configuration

Applies To:	Framework Bundle Pipeline Bundle
Configuration Scope:	Global Pipeline
Databricks Docs:	https://spark.apache.org/docs/latest/configuration.html

The Spark Configuration feature allows you to define and manage Spark configurations either globally at framework level across all pipelines or at pipeline bundle level.

Configuration

Scope: Global

In the Framework bundle, Spark configurations are defined in the global configuration file located at: src/config/global.json|yaml under the spark_config section.

Scope: Bundle

In a Pipeline bundle, Spark configurations are defined in the global configuration file located at: src/pipeline_configs/global.json|yaml under the spark_config section.

Configuration Schema

The Spark configuration section must follow this schema:

{
    "spark_config": {
        "<configuration_key>": "<configuration_value>",
        ...
    }
}

spark_config:
  <configuration_key>: <configuration_value>
  ...

Field	Description
configuration_key	The Spark configuration property name (e.g., “spark.sql.shuffle.partitions”)
configuration_value	The value to set for the configuration property. Can be string, number, or boolean depending on the property.

Common Configuration Properties

Here are some commonly used Spark configuration properties:

Property	Description	Default Value
spark.sql.shuffle.partitions	Number of partitions to use for shuffle operations	200
spark.sql.files.maxPartitionBytes	Maximum size of a partition during file read	128MB
spark.sql.adaptive.enabled	Enable adaptive query execution	true
spark.sql.broadcastTimeout	Timeout in seconds for broadcast joins	300

Example Configuration

Below is an example of a typical Spark configuration in the global.json|yaml file:

{
    "spark_config": {
        "spark.sql.shuffle.partitions": "200",
        "spark.sql.adaptive.enabled": "true",
        "spark.sql.files.maxPartitionBytes": "134217728",
        "spark.sql.broadcastTimeout": "300"
    }
}

spark_config:
  spark.sql.shuffle.partitions: '200'
  spark.sql.adaptive.enabled: 'true'
  spark.sql.files.maxPartitionBytes: '134217728'
  spark.sql.broadcastTimeout: '300'

Best Practice

Start with the default Spark configurations and adjust based on your specific workload needs

Monitor query performance and resource utilization to optimize configurations

Document any non-standard configuration changes and their rationale

Test configuration changes in development before applying to production

Note

Some Spark configurations may be overridden by Databricks cluster configurations or job-specific settings. Refer to the Databricks documentation for the configuration precedence rules.

Advanced Usage

Dynamic Configuration

For certain use cases, you may want to set different Spark configurations based on the environment or workload. This can be achieved using environment variables or the substitutions feature of the framework.

Example with environment-specific configurations:

{
    "spark_config": {
        "spark.sql.shuffle.partitions": "${SHUFFLE_PARTITIONS}",
        "spark.sql.adaptive.enabled": "${ADAPTIVE_EXECUTION_ENABLED}"
    }
}

spark_config:
  spark.sql.shuffle.partitions: ${SHUFFLE_PARTITIONS}
  spark.sql.adaptive.enabled: ${ADAPTIVE_EXECUTION_ENABLED}

Performance Tuning

When tuning Spark configurations for performance:

Start with the defaults
Monitor query performance and resource utilization
Identify bottlenecks
Adjust relevant configurations
Test and measure impact
Document successful optimizations

Important

Incorrect Spark configurations can significantly impact performance and stability. Always test configuration changes in a development environment first.