Bundle Scope and Structure

When creating a Pipeline Bundle, it is important to decide on the scope and structure of the bundle.

This will be informed by the following factors:

Your organizational structure.
Your operational standards, practices and your CI/CD processes.
The size and complexity of your data estate.
The Use Case
The Layer of your Lakehouse you are targeting

Ultimately you will need to determine the best way to scope your Pipeline Bundles for your deployment.

Important

Per the concepts section of this documentation:

A data flow, and its Data Flow Spec, defines the source(s) and logic required to generate a single target table.
A Pipeline Bundle can contain multiple Data Flow Specs, and a Pipeline deployed by the bundle may execute the logic for one or more Data Flow Specs.

For the above reasons the smallest possible unit of logic that can be deployed by a Pipeline Bundle is a single Pipeline, executing a single data flow, that populates a single target table.

Bundle Scope

Bundle Scope simply refers to the level of the logical grouping of Data Flows and Pipeline resources within a Pipeline Bundle.

Some of the most common groupings strategies are shown below:

Logical Grouping	Description
Monolithic	A single Pipeline Bundle containing all ata flows and Pipeline definitions. Only suitable for smaller and simpler deployments.
Bronze	Source System - A Pipeline per Source System or application
Silver / Enterprise Models	Subject Area / Sub-Domain - A Pipeline per Subject Area, or Sub-Domain Use Case - A Pipeline per Use Case Target Table - A Pipeline per target table, the most granular level for complex data flows
Gold / Dimensional Models	Data Mart - A Pipeline per Data Mart Common Dimensions - A Pipeline for your Common Dimensions Target Table - A Pipeline for complex Facts or target tables.
Use Case	You may choose to have an end to end pipeline for given Use Cases

Once you have determined the scope of your Pipeline Bundle, you can move on to determining its structure.

Bundle Structure

The high-level structure of a Pipeline Bundle never changes and is as follows:

my_pipeline_bundle/
├── fixtures/
├── resources/
│   └── my_first_pipeline.yml
├── scratch/
├── src/
│   ├── dataflows
│   └── pipeline_configs
├── databricks.yml
└── README.md

Note

Refer to the Framework Concepts section for more details on the different components of a Pipeline Bundle.

It is the structure of the src/dataflows directory that is flexible and can be organised in the way that best suits your standards and ways of working. The Framework will:

Read all the Data Flow Spec files under the src/dataflows directory, regardless of the folder structure. Filtering of the Dataflows is done when defining your Pipeline and is discussed in the Building a Pipeline Bundle section.
Expect that the schemas, transforms and expectations related to a Data Flow Spec are located in their respective schemas, dml and expectations sub-directories within the Data Flow Spec’s home directory.

The most common ways to organize your src/dataflows directory are:

Flat:

my_pipeline_bundle/
├── src/
│   ├── dataflows
│   │   ├── table_1_data_flow_spec_main.json
│   │   ├── table_2_data_flow_spec_main.json
│   │   ├── dml
│   │   │   ├── table_1_tfm.sql
│   │   │   ├── table_2_tfm_1.sql
│   │   │   └── table_2_tfm_2.sql
│   │   ├── expectations
│   │   │   └── table_2_dqe.json
│   │   ├── python_functions
│   │   └── schemas
│   │       ├── table_1.json
│   │       └── table_2.json

By Use Case:

my_pipeline_bundle/
├── src/
│   ├── dataflows
│   │   ├── use_case_1
│   │   │   ├── table_1_data_flow_spec_main.json
│   │   │   ├── table_2_data_flow_spec_main.json
│   │   │   ├── dml
│   │   │   │   ├── table_1_tfm.sql
│   │   │   │   ├── table_2_tfm_1.sql
│   │   │   │   └── table_2_tfm_2.sql
│   │   │   ├── expectations
│   │   │   ├── python_functions
│   │   │   └── schemas
│   │   │       ├── table_1.json
│   │   │       └── table_2.json
│   │   └── use_case_2
│   │       ├── table_1_data_flow_spec_main.json
│   │       ├── table_2_data_flow_spec_main.json
│   │       ├── dml
│   │       ├── expectations
│   │       ├── python_functions
│   │       └── schemas

By Target Table:

my_pipeline_bundle/
├── src/
│   ├── dataflows
│   │       ├── table_1
│   │       │   ├── table_1_data_flow_spec_main.json
│   │       │   ├── dml
│   │       │   │   ├── table_1_tfm.sql
│   │       │   ├── expectations
│   │       │   ├── python_functions
│   │       │   └── schemas
│   │       │       └── table_1.json
│   │       └── table_2
│   │           ├── table_2_data_flow_spec_main.json
│   │           ├── dml
│   │           │   ├── table_2_tfm_1.sql
│   │           │   └── table_2_tfm_2.sql
│   │           ├── expectations
│   │           │   └── table_2_dqe.json
│   │           ├── python_functions
│   │           └── schemas
│   │               └── table_2.json
│   └── pipeline_configs