SQL Source

Applies To:	Pipeline Bundle
Configuration Scope:	Data Flow Spec
Databricks Docs:	NA

Overview

You can specify a SQL query as a source type in your Data Flow Specs. These allow for flexibility and more complex transformations to be supported, as needed, without overly complicating the Framework.

Sample Bundle

A sample is available in the gold_sample bundle in the src/dataflows/stream_static_samples folder and can be seen in the dim_customer_sql_main.json file.

Configuration

SQL Query Definition

To define a SQL query, you need to create a dml folder under the base folder for your given Data Flow Spec. You can then create a .sql file for your query under this folder.

For example:

my_pipeline_bundle/
├── src/
│   ├── dataflows
│   │   ├── use_case_1
│   │   │   ├── my_data_flow_spec_main.json
│   │   │   ├── dml
│   │   │   │   └── my_query.sql
│   │   │   ├── expectations
│   │   │   ├── python_functions
│   │   │   └── schemas

Your file can contain any SQL supported by Databricks but must ultimately return a dataset as a Single query. You can use CTEs, subqueries, joins, etc.

Substitution Variables

You can use substitution variables in your SQL query by using the {var} syntax. These will be substituted per the :doc:`feature_substitutions documentation.

For example:

SELECT * FROM {bronze_schema}.my_table

Referencing the Python Source in a Data Flow Spec

To reference the Python source in a Data Flow Spec, you need to specify a Python source type in your Data Flow Spec. Refer to the Data Flow Spec - Source Details section of the Data Flow Spec Reference documentation for more information.