Python Dependency Management
Applies To: |
Framework Pipeline Bundle |
Configuration Scope: |
Pipeline |
Overview
The Lakeflow Framework provides flexible Python dependency management at two levels:
Framework Level: Global dependencies required by the framework or custom extensions (
requirements.txt)Pipeline Bundle Level: Bundle-specific dependencies configured via Databricks Asset Bundles
This separation allows the framework to maintain its core dependencies independently while enabling pipeline developers to add custom packages for their specific use cases.
Important
Databricks recommends using the pipeline environment settings to manage Python dependencies.
Framework Dependencies
The framework includes a requirements.txt file at the root of the repository that defines global dependencies required for the framework to function.
Location
dlt_framework/
├── requirements.txt # Framework dependencies
├── requirements-dev.txt # Development dependencies (testing, docs, etc.)
└── src/
└── ...
Framework requirements.txt
## requirements.txt: dependencies for runtime.
## Core dependencies
jsonschema
## Add any additional dependencies needed for custom functionality below here
Note
The framework’s core dependencies are intentionally minimal. Add any additional dependencies needed for custom functionality below the core dependencies, do not change the core dependencies.
Pipeline Bundle Dependencies
For pipeline-specific Python dependencies, Databricks recommends using the pipeline environment configuration in your Databricks Asset Bundle. For detailed information, see the official Databricks documentation:
Configuring Pipeline Environment
Add the environment section to your pipeline resource definition in your Databricks Asset Bundle:
resources:
pipelines:
my_pipeline:
name: My Pipeline (${var.logical_env})
channel: CURRENT
serverless: true
catalog: ${var.catalog}
schema: ${var.schema}
environment:
dependencies:
- -r
${workspace.file_path}/requirements.txt
libraries:
- notebook:
path: ${var.framework_source_path}/dlt_pipeline
Using a Requirements File
The recommended approach is to reference a requirements.txt file in your pipeline bundle:
Step 1: Create a requirements.txt in your pipeline bundle
For example: .. code-block:: text
- caption:
my_pipeline_bundle/requirements.txt requests>=2.28.0 openpyxl
Step 2: Reference it in your pipeline environment
environment:
dependencies:
- -r
${workspace.file_path}/requirements.txt
Important
The -r flag tells pip to read requirements from a file. The path ${workspace.file_path} is substituted with the deployed bundle location in the Databricks workspace.
Inline Dependencies
For simple cases with few dependencies, you can specify packages inline:
environment:
dependencies:
- requests>=2.28.0
- pandas>=2.0.0
Installing from Unity Catalog Volumes
You can also install Python wheel packages stored in Unity Catalog volumes:
environment:
dependencies:
- /Volumes/my_catalog/my_schema/my_volume/my_package-1.0-py3-none-any.whl
Best Practices
Version Pinning
Always pin dependency versions to ensure reproducible builds:
# Recommended: Pin to minimum version
requests>=2.28.0
# For strict reproducibility
pandas==2.0.3
# Avoid: Unpinned versions
requests # Not recommended
Documentation
Add comments to explain why each dependency is needed:
# HTTP client for external API integrations
requests>=2.28.0
# JSON schema validation for custom specs
jsonschema>=4.0.0
# Date parsing utilities for transform functions
python-dateutil>=2.8.0
Testing Dependencies Locally
Before deploying, test that dependencies install correctly:
# Create a virtual environment
python -m venv test_env
source test_env/bin/activate
# Install dependencies
pip install -r requirements.txt
# Verify imports work
python -c "import requests; import pandas; print('Success!')"
Limitations
JVM Libraries Not Supported: Lakeflow Declarative Pipelines only support SQL and Python. JVM libraries (Scala/Java) cannot be used and may cause unpredictable behavior.
Startup Time Impact: Each additional dependency increases pipeline startup time. Keep dependencies minimal for faster pipeline starts.
No Hot Reloading: Dependencies are installed at pipeline startup. Adding new dependencies requires a pipeline restart.
Cluster-Wide Scope: Dependencies are installed for the entire pipeline cluster. Be mindful of potential conflicts between packages.
Troubleshooting
Dependencies Not Found
If packages aren’t being installed:
Verify the
environmentsection is correctly indented in your YAMLCheck that the path to
requirements.txtis correctEnsure the requirements file is included in your bundle deployment
# Verify correct path substitution
environment:
dependencies:
- -r
${workspace.file_path}/requirements.txt # Points to bundle root
Version Conflicts
If you encounter version conflicts:
Check for conflicting versions between framework and bundle requirements
Use
pip checklocally to identify conflictsConsider pinning specific versions to resolve conflicts
pip install -r requirements.txt
pip check # Shows any dependency conflicts