Versioning - DataFlow Specs

Applies To:

Pipeline Bundle

Configuration Scope:

Global Individual DataFlow Specs

Databricks Docs:

NA

Overview

The Lakeflow Framework supports dataflow specification versioning to enable backwards compatibility and gradual migration of dataflow specifications. This feature allows different mapping versions to be applied to transform the structure and content of dataflow specifications during processing. This is particularly useful for:

  • Maintaining backwards compatibility when dataflow specification schemas evolve

  • Gradually migrating existing dataflow specifications to new formats

  • Testing new specification formats without breaking existing workflows

  • Restructuring dataflow specifications to accommodate schema changes

The versioning system applies transformation mappings that can rename fields, move content to different locations, and remove obsolete fields in the dataflow specifications, allowing older specification formats to work with newer framework versions.

Mapping File Structure

DataFlow specification mappings are stored in version-specific directories under: src/config/dataflow_spec_mapping/[version]/dataflow_spec_mapping.json

Each mapping file contains transformation rules organized by:

  • global: Mappings applied to all dataflow specification types

  • [dataflow_type]: Mappings applied only to specific dataflow types (e.g., “standard”, “flow”, “materialized_view”)

The mapping file supports three types of transformations:

  1. Renaming: Change key names while preserving structure

  2. Moving: Relocate keys and values to different parts of the specification

  3. Deleting: Remove obsolete keys and their values

Mapping Operations

Rename Operations

Rename operations change key names while preserving the value and structure. Two types of renaming are supported:

  • rename_all: Recursively renames keys throughout the entire specification structure

  • rename_specific: Renames keys at specific nested paths

{
    "global": {
        "rename_all": {
            "oldKeyName": "newKeyName",
            "cdcApplyChanges": "cdcSettings"
        },
        "rename_specific": {
            "targetDetails.topic": "targetDetails.name"
        }
    }
}

Move Operations

Move operations relocate keys and their values to different locations within the specification structure using dot notation for nested paths:

{
    "global": {
        "move": {
            "legacyConfig": "settings.advanced.legacyConfig",
            "sourceMetadata": "targetDetails.metadata",
            "targetDetails.topic": "targetDetails.sinkOptions.topic"
        }
    }
}

Delete Operations

Delete operations remove keys and their values from the specification:

{
    "global": {
        "delete": [
            "deprecatedField",
            "obsoleteTimestamp"
        ]
    }
}

Complete Example

A comprehensive mapping file combining all operation types:

{
    "global": {
        "rename_all": {
            "cdcApplyChanges": "cdcSettings",
            "cdcApplyChangesFromSnapshot": "cdcSnapshotSettings"
        },
        "rename_specific": {
            "targetDetails.topic": "targetDetails.name"
        },
        "move": {
            "legacyConfig": "settings.advanced.legacyConfig",
            "targetDetails.topic": "targetDetails.sinkOptions.topic"
        },
        "delete": [
            "deprecatedField",
            "obsoleteTimestamp"
        ]
    },
    "flow": {
        "rename_all": {
            "flowSpecificOldField": "flowSpecificNewField"
        },
        "move": {
            "flowConfig": "flowGroups.0.configuration"
        },
        "delete": [
            "temporaryField"
        ]
    }
}

Configure Global DataFlow Version

To set a global dataflow specification version that applies to all specifications in a pipeline, configure the dataflow_spec_version parameter in your pipeline configuration.

This can be set in your pipeline substitutions file:

{
    "dataflow_spec_version": "0.1.0"
}

When a global version is set, all dataflow specifications in the pipeline will use this mapping version unless overridden at the individual specification level.

Configure Individual DataFlow Specification Version

Individual dataflow specifications can override the global version by setting the dataFlowVersion field in their specification file:

{
    "dataFlowId": "customer_data_flow",
    "dataFlowGroup": "customers",
    "dataFlowType": "standard",
    "dataFlowVersion": "0.1.0",
    "data": {
        "sourceType": "delta",
        "targetFormat": "delta",
        "targetDetails": {
            "table": "customers"
        }
    }
}

The individual specification version takes precedence over the global version for that specific dataflow.

Note

If neither global nor individual dataflow version is specified, no mappings will be applied and the specifications will be processed with their original structure.

Transformation Order

The framework applies transformations in a specific order to ensure consistent results:

  1. Move Operations: Keys are copied to their new locations first

  2. Rename Specific Operations: Keys are renamed using rename_specific (targeted)

  3. Rename All Operations: Keys are renamed using rename_all (recursive)

  4. Move Cleanup: Original source keys from move operations are removed

  5. Delete Operations: Keys specified in delete operations are removed

Best Practices

  1. Default Behavior - Only use dataflow versioning when backwards compatibility is required - New specifications should use the current schema format without versioning

  2. Version Consistency - Use consistent version numbers across related dataflow specifications

  3. Migration Strategy - Start with global version configuration for bulk migrations - Use individual specification versions for gradual, selective migration - Remove version specifications once migration is complete - Plan transformation order carefully when combining multiple operation types

  4. Path Notation - Use dot notation for nested paths (e.g., "parentKey.childKey.grandchildKey") - Ensure target paths exist or can be created for move operations - Be careful with path conflicts when moving and renaming the same keys

  5. Testing - Thoroughly test mapping transformations in development environments - Validate that transformed specifications maintain expected functionality - Test all operation types (rename, move, delete) both individually and in combination - Keep original specifications as backup during migration

Version Management

  1. Mapping versions should follow semantic versioning (MAJOR.MINOR.PATCH)

  2. Each mapping version should be stored in its own directory under src/config/dataflow_spec_mapping/

  3. Maintain documentation of what each version transforms and why

  4. Keep mapping files immutable once deployed to ensure consistency

  5. Create new mapping versions rather than modifying existing ones

  6. Archive obsolete mapping versions only after confirming no pipelines depend on them

Troubleshooting

Common Issues:

  • Mapping not applied: Verify the version string exactly matches the directory name under dataflow_spec_mapping/

  • Key not transformed: Check that the key exists in the appropriate section (global or type-specific) of the mapping file

  • Specification validation errors: Ensure transformed keys match the expected schema after mapping application

  • Move operation fails: Verify source keys exist and target paths are valid using dot notation

  • Path conflicts: Check for conflicts between move targets and existing keys

  • Delete operation errors: Ensure keys to be deleted exist at the specified paths

Debugging: Enable debug logging to see transformation application details:

# Framework logs will show:
# "Global Dataflow Spec Mapping Version: [version]"
# "Retrieved Dataflow Spec Specific Mapping Version: [version]. Dataflow Spec ID: [id]"
# "Mapping applied to spec: [spec_path]"
# "New spec: [spec_data]"