Templates ========= .. list-table:: :header-rows: 0 * - **Applies To:** - :bdg-info:`Pipeline Bundle` * - **Configuration Scope:** - :bdg-info:`Pipeline` Overview -------- The Dataflow Spec Templates feature allows data engineers to create reusable templates for dataflow specifications. This significantly reduces code duplication when multiple dataflows share similar structures but differ only in specific parameters (e.g., table names, columns, etc.). .. important:: Templates provide a powerful mechanism for standardizing dataflow patterns across your organization while maintaining flexibility for specific implementations. This feature allows development teams to: - **Reduce Code Duplication**: Write once, reuse many times - **Ensure Consistency**: Similar dataflows follow the same structure - **Improve Productivity**: Quickly create multiple similar specifications - **Reduce Errors**: Less copy-paste reduces human error - **Make Patterns Explicit**: Templates make organizational patterns discoverable .. note:: Template processing happens during the initialization phase of pipeline execution as the dataflow specs are loaded. Each processed spec is validated using the standard validation process. How It Works ------------ The template system consists of three main components: 1. **Template Definitions**: JSON files containing template definitions with placeholders 2. **Template Dataflow Specifications**: A dataflow specification that references a template and provides parameter sets 3. **Template Processing**: Framework logic that processes the template dataflow specifications and generates one dataflow spec per parameter set. Anatomy of a Template Definition ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A template definition is a JSON file that defines a reusable dataflow pattern. It consists of three main components: .. tabs:: .. tab:: JSON .. code-block:: json { "name": "standard_cdc_template", "parameters": { "dataFlowId": { "type": "string", "required": true }, "sourceDatabase": { "type": "string", "required": true }, "sourceTable": { "type": "string", "required": true }, "targetTable": { "type": "string", "required": true } }, "template": { "dataFlowId": "${param.dataFlowId}", "sourceDetails": { "database": "${param.sourceDatabase}", "table": "${param.sourceTable}" }, "targetDetails": { "table": "${param.targetTable}" } } } .. tab:: YAML .. code-block:: yaml name: standard_cdc_template parameters: dataFlowId: type: string required: true sourceDatabase: type: string required: true sourceTable: type: string required: true targetTable: type: string required: true template: dataFlowId: ${param.dataFlowId} sourceDetails: database: ${param.sourceDatabase} table: ${param.sourceTable} targetDetails: table: ${param.targetTable} **Key Components:** .. list-table:: :header-rows: 1 :widths: 20 80 * - Component - Description * - **name** - The unique name for the template. make this the same as the filename. This is currently a placeholder for future functiuonality. * - **parameters** - An object defining all parameters that can be used in the template. Each parameter has a ``type`` (string, list, object, integer, boolean) and ``required`` flag (defaults to true). Optional ``default`` values can be specified. * - **template** - The dataflow specification template containing placeholders in the format ``${param.}``; where ```` is the name of a parameter defined in the ``parameters`` object. This can be any valid dataflow specification structure with parameters substituted at processing time. .. important:: - placeholders can be used in both keys, as full values or as part of or a full string value. - in JSON specs placeholders must always be wrapped in quotes: ``"${param.name}"`` **File Location:** - Template definitions: ``/templates/.json` Anatomy of a Template Dataflow Specification ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A template dataflow specification is a simplified file that references a template and provides parameter sets for instantiation. Instead of writing full dataflow specs, data engineers create a template reference: .. tabs:: .. tab:: JSON .. code-block:: json { "template": "standard_cdc_template", "parameterSets": [ { "dataFlowId": "customer_scd2", "sourceDatabase": "{bronze_schema}", "sourceTable": "customer_raw", "targetTable": "customer_scd2" }, { "dataFlowId": "customer_address_scd2", "sourceDatabase": "{bronze_schema}", "sourceTable": "customer_address_raw", "targetTable": "customer_address_scd2" } ] } .. tab:: YAML .. code-block:: yaml template: standard_cdc_template parameterSets: - dataFlowId: customer_scd2 sourceDatabase: '{bronze_schema}' sourceTable: customer_raw targetTable: customer_scd2 - dataFlowId: customer_address_scd2 sourceDatabase: '{bronze_schema}' sourceTable: customer_address_raw targetTable: customer_address_scd2 **Key Components:** .. list-table:: :header-rows: 1 :widths: 25 75 * - Component - Description * - **template** - The filename of the template definition to use (without the ``.json`` extension). The framework will search for this template in the configured template directories. * - **parameterSets** - An array of parameter sets. Each object in the array represents one set of parameter values that will generate one complete dataflow specification. Each parameter set must include all required parameters defined in the template definition. .. important:: - Each parameter set must include a unique ``dataFlowId`` value - The array must contain at least one parameter set - All required parameters from the template definition must be provided in each parameter set **File Location:** Template dataflow specifications follow the standard dataflow specification naming convention: ``/dataflows//dataflowspec/*_main.json`` **Processing Result:** A template dataflow specification with N parameter sets will generate N complete dataflow specifications at runtime, each validated independently. Template Processing ^^^^^^^^^^^^^^^^^ During the dataflow spec build process, the template processor will: 1. Detect spec files containing a ``template`` key 2. Loads the referenced template file 3. For each parameter set in ``parameterSets``, create a concrete spec by replacing all ``${param.}`` placeholders 4. Validate each expanded spec using the existing schema validators 5. Return the expanded specs with unique internal identifiers Example Usage ------------- Example: Basic File Source Ingestion Template ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This example shows a template for basic file source ingestion, from a hypothetical source system called "erp_system". **Template Definition** (``src/templates/bronze_erp_system_file_ingestion_template.json|yaml``): .. tabs:: .. tab:: JSON .. code-block:: json { "name": "bronze_erp_system_file_ingestion_template", "parameters": { "dataFlowId": { "type": "string", "required": true }, "sourceTable": { "type": "string", "required": true }, "schemaPath": { "type": "string", "required": true }, "targetTable": { "type": "string", "required": true } }, "template": { "dataFlowId": "${param.dataFlowId}", "dataFlowGroup": "bronze_erp_system", "dataFlowType": "standard", "sourceSystem": "erp_system", "sourceType": "cloudFiles", "sourceViewName": "v_${param.sourceTable}", "sourceDetails": { "path": "{landing_erp_file_location}/${param.sourceTable}/", "readerOptions": { "cloudFiles.format": "csv", "header": "true" }, "schemaPath": "${param.schemaPath}" }, "mode": "stream", "targetFormat": "delta", "targetDetails": { "table": "${param.targetTable}" } } } .. tab:: YAML .. code-block:: yaml name: bronze_erp_system_file_ingestion_template parameters: dataFlowId: type: string required: true sourceTable: type: string required: true schemaPath: type: string required: true targetTable: type: string required: true template: dataFlowId: ${param.dataFlowId} dataFlowGroup: bronze_erp_system dataFlowType: standard sourceSystem: erp_system sourceType: cloudFiles sourceViewName: v_${param.sourceTable} sourceDetails: path: '{landing_erp_file_location}/${param.sourceTable}/' readerOptions: cloudFiles.format: csv header: 'true' schemaPath: ${param.schemaPath} mode: stream targetFormat: delta targetDetails: table: ${param.targetTable} **Template Dataflow Specification** (``src/dataflows/bronze_erp_system/dataflowspec/bronze_erp_system_file_ingestion_main.json|yaml``): .. tabs:: .. tab:: JSON .. code-block:: json { "template": "bronze_erp_system_file_ingestion_template", "parameterSets": [ { "dataFlowId": "customer_file_source", "sourceTable": "customer", "schemaPath": "customer_schema.json", "targetTable": "customer" }, { "dataFlowId": "customer_address_file_source", "sourceTable": "customer_address", "schemaPath": "customer_address_schema.json", "targetTable": "customer_address" }, { "dataFlowId": "supplier_file_source", "sourceTable": "supplier", "schemaPath": "supplier_schema.json", "targetTable": "supplier" } ] } .. tab:: YAML .. code-block:: yaml template: bronze_erp_system_file_ingestion_template parameterSets: - dataFlowId: customer_file_source sourceTable: customer schemaPath: customer_schema.json targetTable: customer - dataFlowId: customer_address_file_source sourceTable: customer_address schemaPath: customer_address_schema.json targetTable: customer_address - dataFlowId: supplier_file_source sourceTable: supplier schemaPath: supplier_schema.json targetTable: supplier **Result**: This template dataflow specification generates **3 concrete dataflow specs**, one for each parameter set in the ``parameterSets`` array. Parameter Types --------------- Parameters support multiple data types and structures: .. list-table:: :header-rows: 1 :widths: 20 30 50 * - Type - Template Usage - Example * - **Strings** - ``"${param.tableName}"`` - ``"tableName": "customer"`` * - **Numbers** - ``${param.batchSize}`` - ``"batchSize": 1000`` * - **Booleans** - ``${param.enabled}`` - ``"enabled": true`` * - **Arrays** - ``${param.keyColumns}`` - ``"keyColumns": ["ID", "DATE"]`` * - **Objects** - ``${param.config}`` - ``"config": {"key": "value"}`` Key Features ------------ Python Function Path Search Priority ^^^^^^^^^^^^^^^^^^^^^^^^^ Enhanced fallback chain for path values. The framework searches for python function path values in the following order: 1. In the pipeline bundle base path of the dataflow spec file 2. Under the templates directory of the pipeline bundle 3. Under the extensions directory of the pipeline bundle 4. Under the framework extensions directory Error Handling ^^^^^^^^^^^^^^ The framework provides clear error messages for common issues: - **Missing template file**: Lists all searched locations - **Missing parameters**: Warns about unreplaced placeholders - **Invalid JSON**: Shows parsing errors with context - **Validation errors**: Each expanded spec is validated individually Validation ^^^^^^^^^^ Each expanded spec is validated using the existing schema validators to ensure correctness. Template usage specs are validated against the schema at ``src/schemas/spec_template.json``: - ``template``: Required string (template name without .json extension) - ``params``: Required array with at least one parameter object - Each parameter object must be a dictionary with at least one key-value pair Unique Identifiers ^^^^^^^^^^^^^^^^^^ Generated specs receive unique internal keys in the format ``path#template_0``, ``path#template_1``, etc., to ensure proper tracking and debugging. Best Practices -------------- Naming Conventions ^^^^^^^^^^^^^^^^^^ 1. **Template Files**: Use descriptive names ending with ``_template`` (e.g., ``standard_cdc_template.json``) 2. **Parameter Names**: Use clear, descriptive names (e.g., ``sourceTable`` instead of ``st``) 3. **Consistency**: Maintain consistent naming patterns across related templates Development and Testing ^^^^^^^^^^^^^^^^^^^^^^^ 1. **Concrete First**: Develop a concrete dataflow spec first, get it working and then turn it into a template defintion. 1. **Validation**: Always test processed specs by running the pipeline with a small subset of data 2. **Version Control**: Track templates in version control to maintain a history of changes 3. **Iterative Development**: Start with a simple template and enhance it as patterns emerge Maintainability ^^^^^^^^^^^^^^^ 1. **Template Updates**: When updating a template, test all usages to ensure compatibility 2. **Parameter Validation**: Document required parameters for each template 3. **Backwards Compatibility**: Consider versioning templates if making breaking changes Limitations ----------- The current template implementation has the following limitations, which may be addressed in future versions: 1. **No Template Sub Components (Blocks)**: Templates cannot reference other templates or smaller template blocks 2. **No Conditional Logic**: Complex conditional logic is not supported (consider using multiple templates) .. note:: For complex conditional logic requirements, create multiple templates that represent different scenarios rather than trying to implement logic within a single template.