Schemas
Applies To: |
Pipeline Bundle |
Configuration Scope: |
Data Flow Spec |
Databricks Docs: |
NA |
The Framework supports the defintiion of schemas in the following ways:
- Schema on Read:
A schema can be specified on most sources using JSON StructType format
- Schema on Write:
A schema can be specified for Staging or Target tables using JSON StructType format or text DDL format.
Schema Types
Type |
Description |
Supports |
JSON StructType |
Allows you to specify the schema as a StructType in JSON format. |
|
Text DDL |
Allows you to specify the schema as a text DDL format. |
|
Schema File Location
Schemas must be specified in their own dedicated files and will be locatated in a schemas folder, dependant on your chosen bundle structure as dicussed in the Bundle Scope and Structure section.
Data Flow Spec Configuration
Schema files are then referenced in the Data Flow Spec configuration for the source, staging table or target table they apply to. Refer to the data_flow_spec section for more information.
StructType JSON Format
Can be used for both Schema on Read and Schema on Write.
File name:
<name>.json, the file MUST have a.jsonextension.Documentation:
PySpark StructType documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructType.html
Databricks Data Types documentation: https://docs.databricks.com/en/sql/language-manual/sql-ref-datatypes.html
Generating the Schema Definition
PySpark:
If you have your data in Databricks, you can can read your source into a dataframe and then use the following code to generate the JSON schema format:
df.schema.jsonValue()
LLM:
In Perplexity or ChatGPT the following prompt will generate the JSON schema format:
Prompt:
Convert the following schema definition into the equivalent Databricks StructType JSON format. The output should be a valid JSON object representing the schema, including all field names, data types, nullability, and nested structures where applicable. Do not include any explanatory text—just the JSON output.
Input schema:
[Insert schema definition here]
For Example:
{
"type": "struct",
"fields": [
{
"name": "CUSTOMER_ID",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "FIRST_NAME",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "LAST_NAME",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "EMAIL",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "DELETE_FLAG",
"type": "boolean",
"nullable": true,
"metadata": {}
},
{
"name": "LOAD_TIMESTAMP",
"type": "timestamp",
"nullable": true,
"metadata": {}
}
]
}
Important
The highlighted wrapping struct fields declaration is mandatory.
Text DDL Format
Can only be used to specify the schemas for your staging or target tables.
- Feature support:
Constraints - https://docs.databricks.com/aws/en/tables/constraints
Generated Columns - https://docs.databricks.com/aws/en/delta/generated-columns
Column Masking Functions - https://docs.databricks.com/aws/en/dlt/unity-catalog#row-filters-and-column-masks
File name:
<name>.ddl, the file MUST have a.ddlextension.Documentation:
CREATE TABLE Documentation: https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using
DDL Format Rules
Each column must be defined on a new line
Use
--to comment out columns (they will be removed from the schema)Column names and data types must be valid according to Databricks SQL specifications
Examples
Basic Schema Definition:
CUSTOMER_ID integer NOT NULL,
FIRST_NAME string,
LAST_NAME string,
EMAIL string,
DELETE_FLAG boolean,
LOAD_TIMESTAMP timestamp
Schema Definition with Constraint and Generated Column:
CUSTOMER_ID integer NOT NULL,
FIRST_NAME string,
LAST_NAME string,
EMAIL string,
DELETE_FLAG boolean,
LOAD_TIMESTAMP timestamp,
LOAD_YEAR int GENERATED ALWAYS AS (YEAR(LOAD_TIMESTAMP)),
CONSTRAINT pk_customer PRIMARY KEY(CUSTOMER_ID)
Schema Definition with Comments:
CUSTOMER_ID integer NOT NULL,
FIRST_NAME string,
LAST_NAME string,
EMAIL string,
DELETE_FLAG boolean,
LOAD_TIMESTAMP timestamp,
-- CONSTRAINT pk_customer PRIMARY KEY(CUSTOMER_ID)