Comprehensive guide for the ETL (Extract, Transform, Load) pipeline that prepares data for the multi-agent system.
The ETL pipeline enriches Genie space metadata with table schemas, column details, and sample data, then builds a vector search index for semantic retrieval. This is a prerequisite for the agent system to function.
Script: etl/01_export_genie_spaces.py
What it does:
Outputs:
{catalog}.{schema}.volume/genie_exports/Configuration:
GENIE_SPACE_IDS: List of space IDs to exportGENIE_EXPORTS_VOLUME: UC volume pathRuntime: ~5-10 minutes (depends on number of spaces)
Script: etl/02_enrich_table_metadata.py
What it does:
Outputs:
{catalog}.{schema}.enriched_genie_docs: Enriched metadata table{catalog}.{schema}.enriched_genie_docs_chunks: Chunked documents for vector searchConfiguration:
SAMPLE_SIZE: Number of sample rows per table (default: 100)MAX_UNIQUE_VALUES: Max unique values to capture per column (default: 50)SQL_WAREHOUSE_ID: SQL Warehouse for queryingRuntime: ~15-30 minutes (depends on table sizes)
Script: etl/03_build_vector_search_index.py
What it does:
Outputs:
{catalog}.{schema}.enriched_genie_docs_chunks_vs_index: Vector search indexConfiguration:
VS_ENDPOINT_NAME: Vector search endpointEMBEDDING_MODEL: Embedding model (e.g., databricks-gte-large-en)PIPELINE_TYPE: TRIGGERED or CONTINUOUSRuntime: ~10-20 minutes (includes index sync)
Purpose: Test ETL logic locally before running on Databricks
Setup:
# Install ETL dependencies
pip install -r requirements.txt
# Run local ETL tester
python etl/local_dev_etl.py --step export --sample-size 10
python etl/local_dev_etl.py --step enrich --sample-size 10
python etl/local_dev_etl.py --step vectorize --sample-size 10
# Or run all steps
python etl/local_dev_etl.py --all --sample-size 10
What gets tested:
What doesn’t work locally:
Use cases:
Purpose: Test ETL on real Databricks services with small dataset
Setup:
# In each ETL notebook, set test parameters:
dbutils.widgets.text("test_mode", "True")
dbutils.widgets.text("sample_size", "10")
dbutils.widgets.text("max_spaces", "3")
# Run notebooks in order
What gets tested:
Use cases:
Purpose: Run complete ETL pipeline on full dataset
Setup:
# In each ETL notebook, set production parameters:
# (Remove or set to False)
dbutils.widgets.text("test_mode", "False")
# (Remove sample_size to process all data)
# Run notebooks in order
What runs:
Use cases:
Create a job with three tasks:
name: "Multi-Agent ETL Pipeline"
schedule:
quartz_cron_expression: "0 0 2 * * ?" # 2 AM daily
timezone_id: "UTC"
tasks:
- task_key: "export"
notebook_task:
notebook_path: "/Workspace/etl/01_export_genie_spaces"
- task_key: "enrich"
depends_on: ["export"]
notebook_task:
notebook_path: "/Workspace/etl/02_enrich_table_metadata"
- task_key: "vectorize"
depends_on: ["enrich"]
notebook_task:
notebook_path: "/Workspace/etl/03_build_vector_search_index"
Use Databricks Workflows UI:
ETL uses the same configuration as the agent system:
Uses dev_config.yaml or prod_config.yaml:
# Genie Configuration
genie_space_ids:
- space_id_1
- space_id_2
# Table Metadata
sample_size: 100
max_unique_values: 50
sql_warehouse_id: warehouse_id
# Vector Search
vs_endpoint_name: endpoint_name
embedding_model: databricks-gte-large-en
Uses config.py + .env:
GENIE_SPACE_IDS=space_id_1,space_id_2
SAMPLE_SIZE=100
MAX_UNIQUE_VALUES=50
SQL_WAREHOUSE_ID=warehouse_id
-- Monitor enrichment progress
SELECT
chunk_type,
COUNT(*) as count,
MAX(timestamp) as last_update
FROM {catalog}.{schema}.enriched_genie_docs_chunks
GROUP BY chunk_type;
-- Check specific Genie space
SELECT *
FROM {catalog}.{schema}.enriched_genie_docs_chunks
WHERE space_id = 'your_space_id'
LIMIT 10;
from databricks.vector_search.client import VectorSearchClient
client = VectorSearchClient()
index = client.get_index(
endpoint_name="your_endpoint",
index_name=f"{catalog}.{schema}.enriched_genie_docs_chunks_vs_index"
)
print(f"Index status: {index.describe()}")
Issue: Genie space not accessible
Issue: SQL Warehouse not found
SQL_WAREHOUSE_ID in configurationIssue: Vector Search endpoint not found
Issue: Unity Catalog permission denied
Issue: Enrichment taking too long
SAMPLE_SIZE for testing, or use smaller table subsetAfter ETL, validate:
-- Check all Genie spaces are represented
SELECT DISTINCT space_id, space_name
FROM {catalog}.{schema}.enriched_genie_docs;
-- Check chunk types distribution
SELECT chunk_type, COUNT(*)
FROM {catalog}.{schema}.enriched_genie_docs_chunks
GROUP BY chunk_type;
-- Verify sample data exists
SELECT COUNT(*)
FROM {catalog}.{schema}.enriched_genie_docs_chunks
WHERE chunk_type = 'sample_data';
Monitor these metrics:
After ETL completes successfully:
ETL is critical! Make sure it completes successfully before developing agents. 🎯