β οΈ IMPORTANT: Run ETL pipeline BEFORE developing/deploying agents. The agent system depends on the data prepared by ETL.
The ETL pipeline prepares enriched metadata and vector search index that agents use for semantic query routing:
01_export_genie_spaces.py)
02_enrich_table_metadata.py)
enriched_genie_docs table with comprehensive metadata03_build_vector_search_index.py)
Test ETL transformations locally with sample data before running on full dataset.
# Test individual steps
python local_dev_etl.py --step export --sample-size 10
python local_dev_etl.py --step enrich --sample-size 10
python local_dev_etl.py --step vectorize --sample-size 10
# Test complete pipeline
python local_dev_etl.py --all --sample-size 10
When to use:
Time: ~5 minutes
Test on real Databricks services with small dataset to validate before production run.
# In Databricks, run notebooks with test parameters:
# 01_export_genie_spaces.py
dbutils.widgets.text("sample_size", "10")
dbutils.widgets.text("test_mode", "True")
# 02_enrich_table_metadata.py
dbutils.widgets.text("sample_size", "10")
dbutils.widgets.text("test_mode", "True")
# 03_build_vector_search_index.py
dbutils.widgets.text("sample_size", "10")
dbutils.widgets.text("test_mode", "True")
When to use:
Time: ~10-15 minutes
Run full ETL pipeline on complete dataset for production use.
# In Databricks, run notebooks with production parameters:
# 01_export_genie_spaces.py
# (No sample_size - processes all Genie spaces)
# 02_enrich_table_metadata.py
# (No sample_size - enriches all tables)
# 03_build_vector_search_index.py
# (Builds full vector search index)
When to use:
Time: Varies by data size (typically 30-60 minutes)
ETL scripts must be run in order:
1. export_genie_spaces
β
2. enrich_table_metadata (depends on step 1)
β
3. build_vector_search_index (depends on step 2)
β
4. Agent development can begin β
Before running ETL:
After successful ETL run, you should have:
{catalog}.{schema}.volume{catalog}.{schema}.enriched_genie_docs{catalog}.{schema}.enriched_genie_docs_chunks{catalog}.{schema}.enriched_genie_docs_chunks_vs_index| File | Purpose | When to Run |
|---|---|---|
local_dev_etl.py |
Local ETL testing | During development |
01_export_genie_spaces.py |
Export Genie metadata | Step 1 (Databricks) |
02_enrich_table_metadata.py |
Enrich table metadata | Step 2 (Databricks) |
03_build_vector_search_index.py |
Build vector index | Step 3 (Databricks) |
test_etl.py |
ETL integration tests | Validation |
ETL uses the same configuration as the agent system:
dev_config.yaml or prod_config.yamlconfig.py + .envKey configuration values:
CATALOG_NAME: Unity Catalog catalog nameSCHEMA_NAME: Schema name for tablesGENIE_SPACE_IDS: List of Genie space IDs to processSQL_WAREHOUSE_ID: SQL Warehouse for queriesVS_ENDPOINT_NAME: Vector Search endpoint nameIssue: ModuleNotFoundError: No module named 'config'
Issue: PermissionError: Access denied to Unity Catalog
Issue: Vector Search index not syncing
Issue: Genie space not found
For detailed troubleshooting:
After ETL completes successfully:
Remember: ETL is a prerequisite. Agents cannot function without enriched metadata and vector search index! π―