Schemas

Applies To:

Pipeline Bundle

Configuration Scope:

Data Flow Spec

Databricks Docs:

NA

The Framework supports the defintiion of schemas in the following ways:

  • Schema on Read:
    • A schema can be specified on most sources using JSON StructType format

  • Schema on Write:
    • A schema can be specified for Staging or Target tables using JSON StructType format or text DDL format.

Schema Types

Type

Description

Supports

JSON StructType

Allows you to specify the schema as a StructType in JSON format.

  • Can be used for both Schema on Read and Schema on Write.

  • Can be used to define columns only.

Text DDL

Allows you to specify the schema as a text DDL format.

Schema File Location

Schemas must be specified in their own dedicated files and will be locatated in a schemas folder, dependant on your chosen bundle structure as dicussed in the Bundle Scope and Structure section.

Data Flow Spec Configuration

Schema files are then referenced in the Data Flow Spec configuration for the source, staging table or target table they apply to. Refer to the data_flow_spec section for more information.

StructType JSON Format

Generating the Schema Definition

PySpark:

If you have your data in Databricks, you can can read your source into a dataframe and then use the following code to generate the JSON schema format:

df.schema.jsonValue()

LLM:

In Perplexity or ChatGPT the following prompt will generate the JSON schema format:

Prompt:

Convert the following schema definition into the equivalent Databricks StructType JSON format. The output should be a valid JSON object representing the schema, including all field names, data types, nullability, and nested structures where applicable. Do not include any explanatory text—just the JSON output.

Input schema:
[Insert schema definition here]

For Example:

{
    "type": "struct",
    "fields": [
        {
            "name": "CUSTOMER_ID",
            "type": "integer",
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "FIRST_NAME",
            "type": "string",
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "LAST_NAME",
            "type": "string",
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "EMAIL",
            "type": "string",
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "DELETE_FLAG",
            "type": "boolean",
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "LOAD_TIMESTAMP",
            "type": "timestamp",
            "nullable": true,
            "metadata": {}
        }
    ]
}

Important

The highlighted wrapping struct fields declaration is mandatory.

Text DDL Format

DDL Format Rules

  • Each column must be defined on a new line

  • Use -- to comment out columns (they will be removed from the schema)

  • Column names and data types must be valid according to Databricks SQL specifications

Examples

Basic Schema Definition:

CUSTOMER_ID integer NOT NULL,
FIRST_NAME string,
LAST_NAME string,
EMAIL string,
DELETE_FLAG boolean,
LOAD_TIMESTAMP timestamp

Schema Definition with Constraint and Generated Column:

CUSTOMER_ID integer NOT NULL,
FIRST_NAME string,
LAST_NAME string,
EMAIL string,
DELETE_FLAG boolean,
LOAD_TIMESTAMP timestamp,
LOAD_YEAR int GENERATED ALWAYS AS (YEAR(LOAD_TIMESTAMP)),
CONSTRAINT pk_customer PRIMARY KEY(CUSTOMER_ID)

Schema Definition with Comments:

CUSTOMER_ID integer NOT NULL,
FIRST_NAME string,
LAST_NAME string,
EMAIL string,
DELETE_FLAG boolean,
LOAD_TIMESTAMP timestamp,
-- CONSTRAINT pk_customer PRIMARY KEY(CUSTOMER_ID)