Python Dependency Management

Applies To:	Framework Pipeline Bundle
Configuration Scope:	Pipeline

Overview

The Lakeflow Framework provides flexible Python dependency management at two levels:

Framework Level: Global dependencies required by the framework or custom extensions (requirements.txt)
Pipeline Bundle Level: Bundle-specific dependencies configured via Databricks Asset Bundles

This separation allows the framework to maintain its core dependencies independently while enabling pipeline developers to add custom packages for their specific use cases.

Important

Databricks recommends using the pipeline environment settings to manage Python dependencies.

Framework Dependencies

The framework includes a requirements.txt file at the root of the repository that defines global dependencies required for the framework to function.

Location

dlt_framework/
├── requirements.txt          # Framework dependencies
├── requirements-dev.txt      # Development dependencies (testing, docs, etc.)
└── src/
    └── ...

Framework requirements.txt

requirements.txt

 ## requirements.txt: dependencies for runtime.
 ## Core dependencies
 jsonschema

 ## Add any additional dependencies needed for custom functionality below here

Note

The framework’s core dependencies are intentionally minimal. Add any additional dependencies needed for custom functionality below the core dependencies, do not change the core dependencies.

Pipeline Bundle Dependencies

For pipeline-specific Python dependencies, Databricks recommends using the pipeline environment configuration in your Databricks Asset Bundle. For detailed information, see the official Databricks documentation:

Configuring Pipeline Environment

Add the environment section to your pipeline resource definition in your Databricks Asset Bundle:

resources/pipeline.yml

 resources:
   pipelines:
     my_pipeline:
       name: My Pipeline (${var.logical_env})
       channel: CURRENT
       serverless: true
       catalog: ${var.catalog}
       schema: ${var.schema}

       environment:
         dependencies:
           - -r
             ${workspace.file_path}/requirements.txt

       libraries:
         - notebook:
             path: ${var.framework_source_path}/dlt_pipeline

Using a Requirements File

The recommended approach is to reference a requirements.txt file in your pipeline bundle:

Step 1: Create a requirements.txt in your pipeline bundle

For example: .. code-block:: text

caption:

my_pipeline_bundle/requirements.txt requests>=2.28.0 openpyxl

Step 2: Reference it in your pipeline environment

environment:
  dependencies:
    - -r
      ${workspace.file_path}/requirements.txt

Important

The -r flag tells pip to read requirements from a file. The path ${workspace.file_path} is substituted with the deployed bundle location in the Databricks workspace.

Inline Dependencies

For simple cases with few dependencies, you can specify packages inline:

environment:
  dependencies:
    - requests>=2.28.0
    - pandas>=2.0.0

Installing from Unity Catalog Volumes

You can also install Python wheel packages stored in Unity Catalog volumes:

environment:
  dependencies:
    - /Volumes/my_catalog/my_schema/my_volume/my_package-1.0-py3-none-any.whl

Best Practices

Version Pinning

Always pin dependency versions to ensure reproducible builds:

# Recommended: Pin to minimum version
requests>=2.28.0

# For strict reproducibility
pandas==2.0.3

# Avoid: Unpinned versions
requests  # Not recommended

Documentation

Add comments to explain why each dependency is needed:

# HTTP client for external API integrations
requests>=2.28.0

# JSON schema validation for custom specs
jsonschema>=4.0.0

# Date parsing utilities for transform functions
python-dateutil>=2.8.0

Testing Dependencies Locally

Before deploying, test that dependencies install correctly:

# Create a virtual environment
python -m venv test_env
source test_env/bin/activate

# Install dependencies
pip install -r requirements.txt

# Verify imports work
python -c "import requests; import pandas; print('Success!')"

Limitations

JVM Libraries Not Supported: Lakeflow Declarative Pipelines only support SQL and Python. JVM libraries (Scala/Java) cannot be used and may cause unpredictable behavior.
Startup Time Impact: Each additional dependency increases pipeline startup time. Keep dependencies minimal for faster pipeline starts.
No Hot Reloading: Dependencies are installed at pipeline startup. Adding new dependencies requires a pipeline restart.
Cluster-Wide Scope: Dependencies are installed for the entire pipeline cluster. Be mindful of potential conflicts between packages.

Troubleshooting

Dependencies Not Found

If packages aren’t being installed:

Verify the environment section is correctly indented in your YAML
Check that the path to requirements.txt is correct
Ensure the requirements file is included in your bundle deployment

# Verify correct path substitution
environment:
  dependencies:
    - -r
      ${workspace.file_path}/requirements.txt  # Points to bundle root

Version Conflicts

If you encounter version conflicts:

Check for conflicting versions between framework and bundle requirements
Use pip check locally to identify conflicts
Consider pinning specific versions to resolve conflicts

pip install -r requirements.txt
pip check  # Shows any dependency conflicts