Validation
Applies To: |
Pipeline Bundle |
Configuration Scope: |
Pipeline Bundle |
Databricks Docs: |
NA |
Overview
The framework uses the Python jsonschema library to define the schema and validation rules for the:
Data Flow Specifications
Expectations
Secrets Configurations
This provides the following functionality:
Validation in your CI/CD pipelines
Validation at Spark Declarative Pipeline initialization time
How Validation Works
The framework uses the jsonschema library to validate the Data Flow Specifications, Expectations, and Secrets Configurations.
Essentially each time a pipeline executes the following steps are performed:
Step |
Name |
Description |
|---|---|---|
1 |
Load and Initialize Framework |
Load and initialize the Framework |
2 |
Retrieve Data Flow Specifications |
|
3 |
Generate Pipeline Definition |
The Framework will then use the in memory dictionary to initialize the Spark Declarative Pipeline. |
4 |
Execute Pipeline |
The pipeline will then execute the logic defined in the Data Flow Specifications. |
Ignoring Validation Errors
Ignoring validation errors can be useful when iterating in Dev or SIT environments and you want to focus on specific Data Flow Specs (selected by your pipeline filters), without being blocked by validation errors.
You can ignore validation errors by setting the pipeline.ignoreValidationErrors configuration to True.
You can do this in the pipeline resource YAML file or via the Databricks UI in the Spark Declarative Pipeline Settings.
resources:
pipelines:
dlt_framework_samples_bronze_base_pipeline:
name: Lakeflow Framework Samples - Bronze - Base Pipeline (${var.logical_env})
channel: CURRENT
serverless: true
catalog: ${var.catalog}
schema: ${var.schema}
libraries:
- notebook:
path: ${var.framework_source_path}/dlt_pipeline
configuration:
bundle.sourcePath: ${workspace.file_path}/src
bundle.target: ${bundle.target}
framework.sourcePath: ${var.framework_source_path}
workspace.host: ${var.workspace_host}
pipeline.layer: ${var.layer}
logicalEnv: ${var.logical_env}
pipeline.dataFlowGroupFilter: base_samples
pipeline.ignoreValidationErrors: True
Validation via CI/CD
You can validate Data Flow Specification *_main.json files in CI without running a Spark Declarative Pipeline. The framework repository includes a scripts/validate_dataflows.py script: run it from a checkout of Lakeflow Framework repo so it can load JSON Schemas (under src/schemas/) and optional dataflow spec version mappings, then point it at a directory or single *_main.json in your pipeline bundle or monorepo.
Validating in GitHub Actions
If you are using GitHub Actions for your CI/CD pipelines, you can use the composite action in this repository at .github/actions/validate-dataflows/action.yaml. It installs Python and jsonschema and invokes scripts/validate_dataflows.py for you without needing to explicitly checkout the Lakeflow Framework repo.
Requirements
Check out your repository (the pipeline bundle or project that contains the dataflow specs) before running the action, so the
pathinput resolves undergithub.workspace.The action installs Python and
jsonschema, then runs the validator with your chosen options.
Inputs
Input |
Description |
|---|---|
|
Directory or single |
|
If |
|
If |
Example workflow
jobs:
validate-dataflows:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: databricks-solutions/lakeflow_framework/.github/actions/validate-dataflows@main
with:
path: src
# no-mapping: 'false'
# verbose: 'false'
The uses line can reference the databricks-solutions/lakeflow_framework (upstream) or your organisation’s fork/clone of the Lakeflow Framework repository. The path .github/actions/validate-dataflows is the same; set the owner and pin @<ref> to a tag or commit that exists on whichever repository you rely on.
Pin to a tag or a commit instead of a moving branch for reproducible CI (e.g., replace @main with @v0.11.0).
Note
If the framework repository is private, ensure the calling workflow has permission to read it (for example contents: read and access via the default GITHUB_TOKEN or a PAT as required by your org).