Python Dependency Management

Applies To:

Framework Pipeline Bundle

Configuration Scope:

Pipeline

Overview

The Lakeflow Framework provides flexible Python dependency management at two levels:

  1. Framework Level: Global dependencies required by the framework or custom extensions (requirements.txt)

  2. Pipeline Bundle Level: Bundle-specific dependencies configured via Databricks Asset Bundles

This separation allows the framework to maintain its core dependencies independently while enabling pipeline developers to add custom packages for their specific use cases.

Important

Databricks recommends using the pipeline environment settings to manage Python dependencies.

Framework Dependencies

The framework includes a requirements.txt file at the root of the repository that defines global dependencies required for the framework to function.

Location

dlt_framework/
├── requirements.txt          # Framework dependencies
├── requirements-dev.txt      # Development dependencies (testing, docs, etc.)
└── src/
    └── ...

Framework requirements.txt

requirements.txt
 ## requirements.txt: dependencies for runtime.
 ## Core dependencies
 jsonschema

 ## Add any additional dependencies needed for custom functionality below here

Note

The framework’s core dependencies are intentionally minimal. Add any additional dependencies needed for custom functionality below the core dependencies, do not change the core dependencies.

Pipeline Bundle Dependencies

For pipeline-specific Python dependencies, Databricks recommends using the pipeline environment configuration in your Databricks Asset Bundle. For detailed information, see the official Databricks documentation:

Configuring Pipeline Environment

Add the environment section to your pipeline resource definition in your Databricks Asset Bundle:

resources/pipeline.yml
 resources:
   pipelines:
     my_pipeline:
       name: My Pipeline (${var.logical_env})
       channel: CURRENT
       serverless: true
       catalog: ${var.catalog}
       schema: ${var.schema}

       environment:
         dependencies:
           - -r
             ${workspace.file_path}/requirements.txt

       libraries:
         - notebook:
             path: ${var.framework_source_path}/dlt_pipeline

Using a Requirements File

The recommended approach is to reference a requirements.txt file in your pipeline bundle:

Step 1: Create a requirements.txt in your pipeline bundle

For example: .. code-block:: text

caption:

my_pipeline_bundle/requirements.txt requests>=2.28.0 openpyxl

Step 2: Reference it in your pipeline environment

environment:
  dependencies:
    - -r
      ${workspace.file_path}/requirements.txt

Important

The -r flag tells pip to read requirements from a file. The path ${workspace.file_path} is substituted with the deployed bundle location in the Databricks workspace.

Inline Dependencies

For simple cases with few dependencies, you can specify packages inline:

environment:
  dependencies:
    - requests>=2.28.0
    - pandas>=2.0.0

Installing from Unity Catalog Volumes

You can also install Python wheel packages stored in Unity Catalog volumes:

environment:
  dependencies:
    - /Volumes/my_catalog/my_schema/my_volume/my_package-1.0-py3-none-any.whl

Best Practices

Version Pinning

Always pin dependency versions to ensure reproducible builds:

# Recommended: Pin to minimum version
requests>=2.28.0

# For strict reproducibility
pandas==2.0.3

# Avoid: Unpinned versions
requests  # Not recommended

Documentation

Add comments to explain why each dependency is needed:

# HTTP client for external API integrations
requests>=2.28.0

# JSON schema validation for custom specs
jsonschema>=4.0.0

# Date parsing utilities for transform functions
python-dateutil>=2.8.0

Testing Dependencies Locally

Before deploying, test that dependencies install correctly:

# Create a virtual environment
python -m venv test_env
source test_env/bin/activate

# Install dependencies
pip install -r requirements.txt

# Verify imports work
python -c "import requests; import pandas; print('Success!')"

Limitations

  1. JVM Libraries Not Supported: Lakeflow Declarative Pipelines only support SQL and Python. JVM libraries (Scala/Java) cannot be used and may cause unpredictable behavior.

  2. Startup Time Impact: Each additional dependency increases pipeline startup time. Keep dependencies minimal for faster pipeline starts.

  3. No Hot Reloading: Dependencies are installed at pipeline startup. Adding new dependencies requires a pipeline restart.

  4. Cluster-Wide Scope: Dependencies are installed for the entire pipeline cluster. Be mindful of potential conflicts between packages.

Troubleshooting

Dependencies Not Found

If packages aren’t being installed:

  1. Verify the environment section is correctly indented in your YAML

  2. Check that the path to requirements.txt is correct

  3. Ensure the requirements file is included in your bundle deployment

# Verify correct path substitution
environment:
  dependencies:
    - -r
      ${workspace.file_path}/requirements.txt  # Points to bundle root

Version Conflicts

If you encounter version conflicts:

  1. Check for conflicting versions between framework and bundle requirements

  2. Use pip check locally to identify conflicts

  3. Consider pinning specific versions to resolve conflicts

pip install -r requirements.txt
pip check  # Shows any dependency conflicts