Python Dependency Management ============================ .. list-table:: :header-rows: 0 * - **Applies To:** - :bdg-info:`Framework` :bdg-info:`Pipeline Bundle` * - **Configuration Scope:** - :bdg-info:`Pipeline` Overview -------- The Lakeflow Framework provides flexible Python dependency management at two levels: 1. **Framework Level**: Global dependencies required by the framework or custom extensions (``requirements.txt``) 2. **Pipeline Bundle Level**: Bundle-specific dependencies configured via Databricks Asset Bundles This separation allows the framework to maintain its core dependencies independently while enabling pipeline developers to add custom packages for their specific use cases. .. important:: Databricks recommends using the **pipeline environment settings** to manage Python dependencies. Framework Dependencies ---------------------- The framework includes a ``requirements.txt`` file at the root of the repository that defines global dependencies required for the framework to function. Location ^^^^^^^^ :: dlt_framework/ ├── requirements.txt # Framework dependencies ├── requirements-dev.txt # Development dependencies (testing, docs, etc.) └── src/ └── ... Framework requirements.txt ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: text :caption: requirements.txt ## requirements.txt: dependencies for runtime. ## Core dependencies jsonschema ## Add any additional dependencies needed for custom functionality below here .. note:: The framework's core dependencies are intentionally minimal. Add any additional dependencies needed for custom functionality below the core dependencies, do not change the core dependencies. Pipeline Bundle Dependencies ---------------------------- For pipeline-specific Python dependencies, Databricks recommends using the **pipeline environment** configuration in your Databricks Asset Bundle. For detailed information, see the official Databricks documentation: - `Manage Python dependencies for pipelines `_ - `Databricks Asset Bundles - Pipeline Environment `_ Configuring Pipeline Environment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Add the ``environment`` section to your pipeline resource definition in your Databricks Asset Bundle: .. code-block:: yaml :caption: resources/pipeline.yml :emphasize-lines: 10-13 resources: pipelines: my_pipeline: name: My Pipeline (${var.logical_env}) channel: CURRENT serverless: true catalog: ${var.catalog} schema: ${var.schema} environment: dependencies: - -r ${workspace.file_path}/requirements.txt libraries: - notebook: path: ${var.framework_source_path}/dlt_pipeline Using a Requirements File ^^^^^^^^^^^^^^^^^^^^^^^^^ The recommended approach is to reference a ``requirements.txt`` file in your pipeline bundle: **Step 1: Create a requirements.txt in your pipeline bundle** For example: .. code-block:: text :caption: my_pipeline_bundle/requirements.txt requests>=2.28.0 openpyxl **Step 2: Reference it in your pipeline environment** .. code-block:: yaml environment: dependencies: - -r ${workspace.file_path}/requirements.txt .. important:: The ``-r`` flag tells pip to read requirements from a file. The path ``${workspace.file_path}`` is substituted with the deployed bundle location in the Databricks workspace. Inline Dependencies ^^^^^^^^^^^^^^^^^^^ For simple cases with few dependencies, you can specify packages inline: .. code-block:: yaml environment: dependencies: - requests>=2.28.0 - pandas>=2.0.0 Installing from Unity Catalog Volumes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can also install Python wheel packages stored in Unity Catalog volumes: .. code-block:: yaml environment: dependencies: - /Volumes/my_catalog/my_schema/my_volume/my_package-1.0-py3-none-any.whl Best Practices -------------- Version Pinning ^^^^^^^^^^^^^^^ Always pin dependency versions to ensure reproducible builds: .. code-block:: text # Recommended: Pin to minimum version requests>=2.28.0 # For strict reproducibility pandas==2.0.3 # Avoid: Unpinned versions requests # Not recommended Documentation ^^^^^^^^^^^^^ Add comments to explain why each dependency is needed: .. code-block:: text # HTTP client for external API integrations requests>=2.28.0 # JSON schema validation for custom specs jsonschema>=4.0.0 # Date parsing utilities for transform functions python-dateutil>=2.8.0 Testing Dependencies Locally ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Before deploying, test that dependencies install correctly: .. code-block:: bash # Create a virtual environment python -m venv test_env source test_env/bin/activate # Install dependencies pip install -r requirements.txt # Verify imports work python -c "import requests; import pandas; print('Success!')" Limitations ----------- 1. **JVM Libraries Not Supported**: Lakeflow Declarative Pipelines only support SQL and Python. JVM libraries (Scala/Java) cannot be used and may cause unpredictable behavior. 2. **Startup Time Impact**: Each additional dependency increases pipeline startup time. Keep dependencies minimal for faster pipeline starts. 3. **No Hot Reloading**: Dependencies are installed at pipeline startup. Adding new dependencies requires a pipeline restart. 4. **Cluster-Wide Scope**: Dependencies are installed for the entire pipeline cluster. Be mindful of potential conflicts between packages. Troubleshooting --------------- Dependencies Not Found ^^^^^^^^^^^^^^^^^^^^^^ If packages aren't being installed: 1. Verify the ``environment`` section is correctly indented in your YAML 2. Check that the path to ``requirements.txt`` is correct 3. Ensure the requirements file is included in your bundle deployment .. code-block:: yaml # Verify correct path substitution environment: dependencies: - -r ${workspace.file_path}/requirements.txt # Points to bundle root Version Conflicts ^^^^^^^^^^^^^^^^^ If you encounter version conflicts: 1. Check for conflicting versions between framework and bundle requirements 2. Use ``pip check`` locally to identify conflicts 3. Consider pinning specific versions to resolve conflicts .. code-block:: bash pip install -r requirements.txt pip check # Shows any dependency conflicts