Liquid Clustering
Applies To: |
Pipeline Bundle |
Configuration Scope: |
Data Flow Spec |
Databricks Docs: |
Liquid clustering is a Databricks feature that replaces traditional table partitioning and ZORDER to simplify data layout decisions and optimize query performance. It provides flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time.
Use Cases
Databricks recommends liquid clustering for all new Delta tables, which includes both Streaming Tables (STs) and Materialized Views (MVs). The following are examples of scenarios that benefit from clustering:
Tables often filtered by high cardinality columns.
Tables with significant skew in data distribution.
Tables that grow quickly and require maintenance and tuning effort.
Tables with concurrent write requirements.
Tables with access patterns that change over time.
Tables where a typical partition key could leave the table with too many or too few partitions.
Selecting Clustering Keys
If clusterByAuto is set to true, the clustering keys will be automatically selected based on the data in the table. Otherwise, the clustering keys can be specified in the clusterByColumns attribute.
If both clusterByAuto and clusterByColumns are set, the columns specified in clusterByColumns will be used as the initial clustering keys and the keys will be automatically updated based on the data in the table over time.
Please refer to the Liquid Clustering documentation for more information on selecting clustering keys.
Configuration
Enabled by setting the clusterBy attribute as documented in Delta Target Details.
Additionally or alternatively, the clusterByAuto attribute can be enabled to allow for automatic clustering key selection as documented in Delta Target Details.