Creating a Materialized View Data Flow Spec Reference
A Materialized View Data Flow Spec is designed for creating and maintaining materialized views that aggregate or transform data from source tables.
Schema
The following schema details the configuration for a Materialized View Data Flow Spec:
{
"dataFlowId": "feature_materialized_views",
"dataFlowGroup": "feature_samples",
"dataFlowType": "materialized_view",
"materializedViews": {
"mv_name": {
"sourceView": {
"sourceViewName": "",
"sourceType": "[delta|python|sql]",
"sourceDetails": {}
},
"sqlPath": "",
"sqlStatement": "",
"tableDetails": {
"database": "",
"schemaPath": "",
"tableProperties": {},
"path": "",
"partitionColumns": [],
"clusterByColumns": []
},
"dataQualityExpectationsEnabled": false,
"dataQualityExpectationsPath": "",
"quarantineMode": "off",
"quarantineTargetDetails": {}
}
}
}
dataFlowId: feature_materialized_views
dataFlowGroup: feature_samples
dataFlowType: materialized_view
materializedViews:
mv_name:
sourceView:
sourceViewName: ''
sourceType: '[delta|python|sql]'
sourceDetails: {}
sqlPath: ''
sqlStatement: ''
tableDetails:
database: ''
schemaPath: ''
tableProperties: {}
path: ''
partitionColumns: []
clusterByColumns: []
dataQualityExpectationsEnabled: false
dataQualityExpectationsPath: ''
quarantineMode: 'off'
quarantineTargetDetails: {}
Example:
The below demonstrates a Materialized View Data Flow Spec:
{
"dataFlowId": "feature_materialized_views",
"dataFlowGroup": "feature_samples",
"dataFlowType": "materialized_view",
"materializedViews": {
"mv_from_source_view": {
"sourceView": {
"sourceViewName": "v_mv_source_view",
"sourceType": "delta",
"sourceDetails": {
"database": "{staging_schema}",
"table": "customer",
"cdfEnabled": true
}
},
"tableDetails": {
"database": "{gold_schema}",
"tableProperties": {
"delta.autoOptimize.optimizeWrite": "true",
"delta.autoOptimize.autoCompact": "true"
},
"clusterByColumns": ["year", "month"],
"schemaPath": "schemas/customer_metrics_mv.json"
},
},
"mv_from_sql_path": {
"sqlPath": "./mv_from_sql_path.sql"
},
"mv_from_sql_statement": {
"sqlStatement": "SELECT * FROM {staging_schema}.customer"
},
"mv_with_quarantine": {
"sqlStatement": "SELECT * FROM {staging_schema}.customer_address",
"dataQualityExpectationsEnabled": true,
"dataQualityExpectationsPath": "./customer_address_dqe.json",
"quarantineMode": "table",
"quarantineTargetDetails": {
"targetFormat": "delta"
}
}
}
}
dataFlowId: feature_materialized_views
dataFlowGroup: feature_samples
dataFlowType: materialized_view
materializedViews:
mv_from_source_view:
sourceView:
sourceViewName: v_mv_source_view
sourceType: delta
sourceDetails:
database: '{staging_schema}'
table: customer
cdfEnabled: true
tableDetails:
database: '{gold_schema}'
tableProperties:
delta.autoOptimize.optimizeWrite: 'true'
delta.autoOptimize.autoCompact: 'true'
clusterByColumns:
- year
- month
schemaPath: schemas/customer_metrics_mv.json
mv_from_sql_path:
sqlPath: ./mv_from_sql_path.sql
mv_from_sql_statement:
sqlStatement: SELECT * FROM {staging_schema}.customer
mv_with_quarantine:
sqlStatement: SELECT * FROM {staging_schema}.customer_address
dataQualityExpectationsEnabled: true
dataQualityExpectationsPath: ./customer_address_dqe.json
quarantineMode: table
quarantineTargetDetails:
targetFormat: delta
The above dataflow spec sample contains the following core components:
Dataflow metadata configuration
Source configuration
Table configuration
Data quality and quarantine settings
The following sections detail each of the above components.
Dataflow Metadata Configuration
These properties define the basic identity and type of the dataflow:
Field |
Type |
Description |
|---|---|---|
dataFlowId |
|
A unique identifier for the data flow. |
dataFlowGroup |
|
The group to which the data flow belongs, can be the same as dataFlowId if there is no group. |
dataFlowType |
|
The type of data flow. Must be materialized_view for materialized view dataflows. |
Source Configuration
These properties define the source of the data:
Field |
Type |
Description |
|---|---|---|
sourceSystem (optional) |
|
The source system name. Value is not used to determine or change any behaviour. |
sourceType |
|
The type of source.
Supported: |
sourceViewName |
|
The name to assign the source view. String Pattern: v_([A-Za-z0-9_]+) |
sourceDetails |
|
See Data Flow Spec - Source Details for more information. |
Table Configuration
These properties define the materialized view specific configuration:
Field |
Type |
Description |
|---|---|---|
tableDetails |
|
Configuration specific to materialized views. |
database |
|
The schema to write the materialized view to. |
schemaPath |
|
The path to the schema file for the materialized view. |
tableProperties |
|
The table properties to set on the materialized view. |
path |
|
A storage location for table data. If not set, use the managed storage location for the schema containing the table. |
partitionColumns |
|
The columns to partition the materialized view by. |
clusterByColumns |
|
The suggested columns to cluster the materialized view by. |
comment |
|
A description for the materialized view. |
spark_conf (optional) |
|
A list of Spark configurations for the execution of this query. |
private (optional) |
|
Create a table, but do not publish the table to the metastore. |
Data Quality and Quarantine Configuration
These properties control how data quality issues are handled:
Field |
Type |
Description |
|---|---|---|
dataQualityExpectationsEnabled (optional) |
|
A flag indicating whether data quality expectations are enabled (see Data Quality - Quarantine). |
dataQualityExpectationsPath (optional) |
|
Either a relative path or filename for the expectations file. Note that the framework automatically calculates all relative paths from the appropriate expectations sub-folder, in the Pipeline Bundle. Examples:
|
quarantineMode (optional) |
|
The mode for handling quarantined data. It can be off, flag, or table. Supported: [“off”, “flag”, “table”] |
quarantineTargetDetails (optional) |
|
Details about the quarantine target, only required if |
quarantineTargetDetails
The quarantineTargetDetails object contains the following properties:
Parameter |
Type |
Description |
|---|---|---|
targetFormat |
|
The format of the quarantine target. Currently, only Supported:
["delta"]Default:
"delta" |
table |
|
(conditional) The table name, required if |
tableProperties |
|
(conditional) Additional properties for the table, required if |
path |
|
(conditional) The path to the table, required if |