dbx-unifiedchat

ETL Pipeline

⚠️ IMPORTANT: Run ETL pipeline BEFORE developing/deploying agents. The agent system depends on the data prepared by ETL.

What ETL Does

The ETL pipeline prepares enriched metadata and vector search index that agents use for semantic query routing:

  1. Export Genie Spaces (01_export_genie_spaces.py)
    • Exports Genie space metadata to Unity Catalog volume
    • Prerequisite: Genie spaces must exist in your workspace
  2. Enrich Table Metadata (02_enrich_table_metadata.py)
    • Enriches table metadata with samples, statistics, column details
    • Creates enriched_genie_docs table with comprehensive metadata
  3. Build Vector Search Index (03_build_vector_search_index.py)
    • Creates vector search index from enriched metadata
    • Enables semantic search for agent planning

Three ETL Workflows

Workflow 1: Local ETL Testing πŸ§ͺ

Test ETL transformations locally with sample data before running on full dataset.

# Test individual steps
python local_dev_etl.py --step export --sample-size 10
python local_dev_etl.py --step enrich --sample-size 10
python local_dev_etl.py --step vectorize --sample-size 10

# Test complete pipeline
python local_dev_etl.py --all --sample-size 10

When to use:

Time: ~5 minutes


Workflow 2: Databricks Testing (Small Sample) πŸ”¬

Test on real Databricks services with small dataset to validate before production run.

# In Databricks, run notebooks with test parameters:

# 01_export_genie_spaces.py
dbutils.widgets.text("sample_size", "10")
dbutils.widgets.text("test_mode", "True")

# 02_enrich_table_metadata.py
dbutils.widgets.text("sample_size", "10")
dbutils.widgets.text("test_mode", "True")

# 03_build_vector_search_index.py
dbutils.widgets.text("sample_size", "10")
dbutils.widgets.text("test_mode", "True")

When to use:

Time: ~10-15 minutes


Workflow 3: Production ETL πŸš€

Run full ETL pipeline on complete dataset for production use.

# In Databricks, run notebooks with production parameters:

# 01_export_genie_spaces.py
# (No sample_size - processes all Genie spaces)

# 02_enrich_table_metadata.py
# (No sample_size - enriches all tables)

# 03_build_vector_search_index.py
# (Builds full vector search index)

When to use:

Time: Varies by data size (typically 30-60 minutes)

Execution Order

ETL scripts must be run in order:

1. export_genie_spaces
   ↓
2. enrich_table_metadata (depends on step 1)
   ↓
3. build_vector_search_index (depends on step 2)
   ↓
4. Agent development can begin βœ…

Prerequisites

Before running ETL:

Expected Outputs

After successful ETL run, you should have:

Files in This Directory

File Purpose When to Run
local_dev_etl.py Local ETL testing During development
01_export_genie_spaces.py Export Genie metadata Step 1 (Databricks)
02_enrich_table_metadata.py Enrich table metadata Step 2 (Databricks)
03_build_vector_search_index.py Build vector index Step 3 (Databricks)
test_etl.py ETL integration tests Validation

Configuration

ETL uses the same configuration as the agent system:

Key configuration values:

Troubleshooting

Common Issues

Issue: ModuleNotFoundError: No module named 'config'

Issue: PermissionError: Access denied to Unity Catalog

Issue: Vector Search index not syncing

Issue: Genie space not found

Getting Help

For detailed troubleshooting:

Next Steps

After ETL completes successfully:

  1. βœ… Verify outputs exist (tables and vector search index)
  2. βœ… Test vector search index with sample query
  3. βœ… Proceed to Agent Development:

Remember: ETL is a prerequisite. Agents cannot function without enriched metadata and vector search index! 🎯