Data Product Lifecycle¶

Follow this path from setup to production. Each step builds on the previous one.

Before you start

This guide assumes you've already completed the Get Started guide. You should have Docker running, the Vulcan stack up, and the CLI alias configured. If vulcan --help doesn't work yet, it would be ideal to finish the setup first.

Phase 1: Setup & Infrastructure¶

Get your environment ready.

Step 1: Start the Full Stack¶

make up

This starts the complete Vulcan stack: - Docker network vulcan - Statestore (PostgreSQL) on port 5431 - stores Vulcan's internal state - MinIO on ports 9000/9001 - stores query results and artifacts - Transpiler API on port 8100 - converts semantic queries to SQL - Vulcan API on port 8000 - REST API for querying - GraphQL on port 3000 - GraphQL interface - MySQL proxy on port 3306 - for BI tool connectivity

Step 2: Configure CLI Access¶

Create an alias to access the Vulcan CLI. The alias uses an engine-specific Docker image. Postgres is shown by default (recommended for most users). If you're using a different engine, select it from the tabs below:

Automatic Updates

Docker image versions in this section are automatically synchronized with the engine configuration files. When engine image versions are updated, this section is automatically updated as well.

Postgres (Default)BigQueryDatabricksFabricMSSQLMySQLRedshiftSnowflakeSparkTrino

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-postgres:0.228.1.19 vulcan"

Image version from Postgres engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-bigquery:0.228.1.10 vulcan"

Image version from BigQuery engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-databricks:0.228.1.19 vulcan"

Image version from Databricks engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-fabric:0.228.1.6 vulcan"

Image version from Fabric engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-mssql:0.228.1.6 vulcan"

Image version from MSSQL engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-mysql:0.228.1.6 vulcan"

Image version from MySQL engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-redshift:0.228.1.6 vulcan"

Image version from Redshift engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-snowflake:0.228.1.19 vulcan"

Image version from Snowflake engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-spark:0.228.1.19 vulcan"

Image version from Spark engine configuration

alias vulcan="docker run -it --network=vulcan --rm -v .:/workspace tmdcio/vulcan-trino:0.228.1.19 vulcan"

Image version from Trino engine configuration

Verify it works:

vulcan --help

Phase 2: Project Initialization¶

Create your project structure.

Step 3: Initialize Project¶

vulcan init

This creates:

your-project/
├── models/              # SQL/Python transformation models
│   ├── dq/              # Data Quality rule packs (kind: dq)
│   ├── semantics/       # Semantic models (kind: semantic)
│   └── metrics/         # Per-metric files
├── seeds/               # CSV files for static data
├── audits/              # Data quality assertions (blocking)
├── tests/               # Unit tests for models
├── macros/              # Reusable SQL patterns
└── config.yaml          # Project configuration

Step 4: Configure Project¶

Edit config.yaml: - Set database connections - Define model defaults (dialect, start date, cron schedule) - Configure linting rules

Step 5: Verify Setup¶

vulcan info

Checks connection status, project structure, and configuration.

Phase 3: Model Development¶

Write your data transformation logic.

Step 6: Write Models¶

SQL Models (models/example.sql):

MODEL (
  name warehouse.users,
  start '2024-01-01',
  cron '@daily'
);

SELECT 
  user_id,
  email,
  created_at
FROM raw.users
WHERE status = 'active';

Python Models (models/example.py):

def execute(context, start, end):
    # Complex logic, API calls, ML models
    return pd.DataFrame(...)

Most teams start with SQL for transformations, then add Python when they hit logic that's painful in SQL: calling external APIs, running machine learning models, or handling complex business rules.

Step 7: Lint Your Code¶

vulcan info

Vulcan checks for syntax errors, ambiguous columns, and invalid SQL patterns. Code is validated before execution.

Phase 4: Testing & Validation¶

Ensure your models work correctly.

Step 8: Write Tests¶

vulcan create_test model_name

Tests validate model logic locally. They run without touching your warehouse, so you get fast feedback without warehouse costs.

Step 9: Run Tests¶

vulcan test

Tests pass when model logic is correct.

Phase 5: Semantic Layer¶

Define business metrics and dimensions.

Step 10: Define Semantic Models¶

Create models/semantics/users.yml:

kind: semantic
name: users
depends_on: warehouse.users

dimensions:
  - plan_type

measures:
  - name: total_users
    type: count

You get: - Business-friendly query interface - Automatic API generation - Single source of truth for metrics

Phase 6: Planning¶

Review and apply changes safely.

Step 11: Create a Plan¶

vulcan plan

The plan: 1. Validates models and dependencies 2. Calculates which intervals need backfill 3. Shows full impact of changes 4. Creates isolated environment for testing

Plan output shows: - Models that will be created/modified - Data intervals that need processing - Dependencies and execution order

Step 12: Review Plan¶

Check: - Are the right models affected? - Is the backfill scope correct? - Any breaking changes?

Step 13: Apply Plan¶

# When prompted, enter 'y'

Applying the plan: 1. Creates model variants (with unique fingerprints) 2. Creates physical tables in warehouse 3. Backfills historical data 4. Creates/updates views (virtual layer) 5. Updates environment references

Changes are deployed to the target environment.

Phase 7: Running & Scheduling¶

Process new data on schedule.

Step 14: Run Scheduled Execution¶

vulcan run

Running checks for missing intervals (compares with state), filters models by cron schedule (only processes due models), executes missing intervals, and updates state database.

Difference from plan: - plan = Apply code changes - run = Process new data intervals

Step 15: Schedule for Production¶

Set up automation: - Cron job: 0 * * * * vulcan run - CI/CD pipeline: Scheduled workflows - Kubernetes CronJob: Container orchestration

Models run automatically on schedule.

Phase 8: Data Quality¶

Ensure data quality at every step.

Step 16: Write Audits¶

Create audits/unique_users.sql:

SELECT user_id, COUNT(*) as count
FROM warehouse.users
GROUP BY user_id
HAVING COUNT(*) > 1

Audits block bad data before it reaches production. They stop execution if data quality fails. The query returns rows when it finds bad data.

Step 17: Write Data Quality Rules¶

Create models/dq/completeness.yml:

kind: dq
name: users_completeness
depends_on: warehouse.users

rules:
  - missing_count(email) = 0:
      name: user_email_completeness
      dimension: completeness

Data Quality rule packs monitor data quality over time. They're non-blocking (warnings, not failures) and track quality metrics.

Bad data is caught and blocked.

Phase 9: API Access¶

Expose your data via APIs.

Step 18: Query via REST API¶

If you ran make up in Step 1, all API services are already running. You can query your data immediately.

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "measures": ["users.total_users"],
      "dimensions": ["users.plan_type"]
    }
  }'

Step 19: Query via Semantic Layer¶

vulcan transpile --format sql "SELECT MEASURE(total_users) FROM users"

Data is accessible via REST, GraphQL, Python APIs, and semantic queries.

Phase 10: Monitoring & Iteration¶

Monitor, debug, and improve.

Step 20: Monitor Execution¶

# Check project status
vulcan info

# View logs
cat .logs/vulcan_*.log

# Render SQL to debug
vulcan render model_name

# Test queries
vulcan fetchdf "SELECT * FROM schema.model_name LIMIT 10"

Step 21: Iterate¶

When you need to change models: 1. Edit model files 2. Run vulcan plan (applies changes) 3. Run vulcan run (processes new data)

When you need to add features: 1. Add new models → vulcan plan 2. Add semantic definitions → vulcan plan 3. Add audits/checks → vulcan plan 4. Test → vulcan test

Complete Lifecycle Flow¶

graph LR
    A[Setup Infrastructure] --> B[Initialize Project]
    B --> C[Write Models]
    C --> D[Lint Code]
    D --> E[Write Tests]
    E --> F[Run Tests]
    F --> G[Define Semantic Layer]
    G --> H[Create Plan]
    H --> I[Review Plan]
    I --> J[Apply Plan]
    J --> K[Schedule Runs]
    K --> L[Add Audits & Checks]
    L --> M[Start APIs]
    M --> N[Monitor & Iterate]

    style A fill:#e1f5fe,stroke:#0277bd
    style G fill:#fff9c4,stroke:#fbc02d
    style J fill:#c8e6c9,stroke:#2e7d32
    style N fill:#f3e5f5,stroke:#7b1fa2

Key Commands Reference¶

Phase	Command	Purpose
Setup	`make up`	Start full stack (infra + APIs)
Setup	`alias vulcan=...`	Configure CLI
Shutdown	`make down`	Stop all services
Init	`vulcan init`	Create project
Init	`vulcan info`	Verify setup
Develop	`vulcan lint`	Check code quality
Test	`vulcan test`	Run unit tests
Plan	`vulcan plan`	Create & apply changes
Run	`vulcan run`	Process new data
Query	`vulcan fetchdf`	Execute SQL queries
Semantic	`vulcan transpile`	Convert semantic to SQL
Debug	`vulcan render`	See generated SQL

Key Concepts¶

Plans vs Runs: - vulcan plan = Apply code/model changes (use when you modify code) - vulcan run = Process new data intervals (use for scheduled execution)

Environments: - Dev/Staging: Test changes safely - Production: Deploy validated changes - Plans create isolated environments for testing

Data Quality: - Audits: Block bad data (stops execution) - Checks: Monitor quality (warnings only) - Tests: Validate logic (before execution)

Semantic Layer: - Define metrics once, use everywhere - Automatic API generation - Business-friendly query interface

Summary¶

Vulcan's lifecycle:

Setup → Infrastructure and project initialization
Develop → Write models, tests, semantics
Validate → Lint, test, audit, check
Plan → Review and apply changes safely
Run → Process data on schedule
Expose → APIs and semantic queries
Monitor → Logs, status, debugging
Iterate → Continuous improvement

The flow is linear and predictable. Each phase builds on the previous one, and you can trace back to see what happened at each step.

This lifecycle ensures code quality, data quality, safe deployments, and continuous operation of your data pipeline.