Skip to content

Deployment Steps

This guide provides step-by-step instructions for deploying Vulcan data products in a DataOS environment.


Prerequisites

Before deploying a Vulcan data product, ensure you have the following resources configured in your DataOS environment:

1. DataOS CLI

Ensure you have the DataOS CLI installed and configured:

# Verify CLI installation
ds version

# Login to your DataOS instance
ds login

2. Depot (Data Source Connection)

A depot must be configured to connect to your data warehouse (e.g., Snowflake, BigQuery, Databricks).

List available depots:

ds resource -t depot get -a

Note: Ensure the depot has read/write permissions for your data warehouse schema.

3. Engine Stack

An engine stack defines the execution environment for Vulcan operations (e.g., Snowflake, BigQuery, Spark).

List available stacks:

ds resource -t stack get -a

Supported engines: - snowflake - bigquery - databricks - postgres - redshift - trino - mysql - mssql

4. Compute Resource

A compute resource provides the execution environment for running Vulcan workflows.

List available compute resources:

ds resource -t compute get -a

Example compute resources: - cyclone-compute (general purpose) - minerva-compute (query engine) - Custom compute clusters

5. Git-Sync Secret

A secret is required to access your private Git repository containing Vulcan models and configurations.

Create a git-sync secret:

ds resource apply -f git-sync-secret.yaml

Example secret configuration:

name: git-sync
version: v2alpha
type: secret
workspace: system
layer: user
description: "Secret for git-sync authentication (Bitbucket)"
secret:
  type: key-value
  data:
    GITSYNC_USERNAME: "<your-git-username>"
    GITSYNC_PASSWORD: "<your-git-token-or-password>"

Important: Replace GITSYNC_USERNAME and GITSYNC_PASSWORD with your actual Git repository credentials or access tokens.


Configuration Files

Vulcan deployments require two key configuration files:

1. config.yaml - Vulcan Configuration

This file contains Vulcan-specific configurations including model defaults, gateways, notifications, and metadata.

Location: <project-root>/config.yaml

Key sections:

Basic Metadata

name: <data-product-name>
display_name: <Data Product Title>
description: <Description .... >

# Catalog metadata
discoverable: true
version: 0.1.0
alignment: consumer_aligned

# Environment behaviour
vde: false   # set to true to enable Virtual Data Environments; not supported on spark/trino gateways

tags:
  - <tag1>
  - <tag2>

terms:
  - glossary.<term1>
  - glossary.<term2>

metadata:
  domain: <business-domain>
  use_cases:
    - <use-case-1>
    - <use-case-2>
  limitations:
    - <limitation-1>
    - <limitation-2>

Tenant comes from the environment

Set DATAOS_TENANT_ID in your shell or .env. It's no longer a YAML key.

Model Defaults

model_defaults:
  dialect: <engine-dialect>          # Database dialect eg. snowflake, bigquery
  start: '2025-01-01'        # Start date for time-based models
  cron: '<cron>'             # Default scheduling cadence @daily

Gateway Configuration

gateways:
  default:
    connection:
      type: depot
      address: dataos://<depot-name>  # Reference to your depot

Users and Ownership

users:
  - username: <username1>
    github_username: <gh-username1>
    email: <username1@email.id>
    type: OWNER
  - username: <username2>
    github_username: <gh-username2>
    email: <username2@email.id>
    type: OWNER

type: OWNER marks the user as a data product owner. List one entry per owner. github_username drives PR/CI bot interactions; leave it out for users who don't have a GitHub account.

Complete config.yaml Example

📋 Click to see complete config.yaml example
name: user-engagement
display_name: User Engagement Analytics
description: User Engagement Analytics delivers insights into user engagement patterns.

# Catalog metadata
discoverable: true
version: 0.1.0
alignment: consumer_aligned

# Environment behaviour
vde: false   # set to true to enable Virtual Data Environments; not supported on spark/trino gateways

tags:
  - snowflake
  - user_engagement
  - device_analytics

terms:
  - glossary.data_product
  - glossary.analytics_platform
  - glossary.user_engagement

metadata:
  domain: product_analytics
  use_cases:
    - User engagement tracking and analysis
    - Device usage analytics
    - Session and activity monitoring
  limitations:
    - Data available from 2025 onwards
    - Refreshes daily at midnight UTC

model_defaults:
  dialect: snowflake
  start: '2025-01-01'
  cron: '@daily'

gateways:
  default:
    connection:
      type: depot
      address: dataos://snowflakevulcan2

notification_targets:
  - type: console
    notify_on:
      - apply_failure
      - run_failure
      - check_failure

users:
  - username: <owner-username-1>
    github_username: <owner-gh-username-1>
    email: <owner-email-1@example.com>
    type: OWNER
  - username: <owner-username-2>
    github_username: <owner-gh-username-2>
    email: <owner-email-2@example.com>
    type: OWNER

2. domain-resource.yaml - DataOS Resource Configuration

This file defines the DataOS-specific resource configuration for deploying Vulcan as a managed service.

Location: <project-root>/domain-resource.yaml

You can create this file manually, or generate a starter deploy manifest using the Vulcan CLI:

vulcan create_deploy_yaml

Key sections:

Resource Metadata

version: v1alpha
type: vulcan
name: <data-product-name>
tags:
  - <tag1>
  - <tag2>

Execution Configuration

spec:
  runAsUser: "<dataos-username>"     # DataOS user identity
  compute: <compute-name>            # Compute cluster name eg. cyclone-compute
  engine: <engine-name>              # Execution engine eg. snowflake, bigquery

Repository Configuration

  repo:
    url: <git-repository-url>                # eg. https://github.com/org/repo
    syncFlags:
      - '--ref=<branch-name>'                # Git branch eg. main
      - '--submodules=off'
    baseDir: <path-to-project-in-repo>       # Path to project folder
    secret: <workspace>:<secret>          # Git credentials secret eg. engineering:git-sync-name

Depot References

  depots:
    - dataos://<depot-name>?purpose=rw      # Read-write depot access

Workflow Configuration

  workflow:
    schedule:
      crons:
        - '<cron-expression>'  # eg. '*/45 * * * *' (Every 45 minutes)
      endOn: '<end-date>'      # eg. '2027-01-01T00:00:00-00:00'
      timezone: '<timezone>'   # eg. 'US/Pacific'
      concurrencyPolicy: Forbid

    logLevel: INFO

    resource:                   # Resource allocation
      request:
        cpu: "<cpu-request>"   # eg. "200m"
        memory: "<memory-request>"  # eg. "512Mi"
      limit:
        cpu: "<cpu-limit>"     # eg. "1000m"
        memory: "<memory-limit>"    # eg. "1Gi"

Vulcan Commands

    plan:                       # Plan changes
      command: [vulcan]
      arguments:
        - --log-to-stdout
        - plan
        - --auto-apply

    run:                        # Execute models
      command: [vulcan]
      arguments:
        - --log-to-stdout
        - run

API Configuration

  api:
    replicas: <replica-count>     # eg. 1
    logLevel: INFO
    resource:
      request:
        cpu: "<cpu-request>"      # eg. "200m"
        memory: "<memory-request>"     # eg. "512Mi"
      limit:
        cpu: "<cpu-limit>"        # eg. "5000m"
        memory: "<memory-limit>"       # eg. "4Gi"

Complete domain-resource.yaml Example

📋 Click to see complete domain-resource.yaml example
version: v1alpha
type: vulcan
name: user-engagement
tags:
  - snowflake-analytics
  - user_engagement
  - device_analytics
spec:
  runAsUser: "<dataos-username>"
  compute: cyclone-compute
  engine: snowflake
  repo:
    url: https://bitbucket.org/rubik_/vulcan-examples
    syncFlags:
      - '--ref=main'
      - '--submodules=off'
    baseDir: vulcan-examples/customer-usecase/usdk
    secret: engineering:git-sync
  depots:
    - dataos://snowflakevulcan2?purpose=rw
  workflow:
    schedule:
      crons:
        - '*/45 * * * *'
      endOn: '2027-01-01T00:00:00-00:00'
      timezone: 'US/Pacific'
      concurrencyPolicy: Forbid
    logLevel: INFO
    resource:
      request:
        cpu: "200m"
        memory: "512Mi"
      limit:
        cpu: "1000m"
        memory: "1Gi"
    plan:
      command:
        - vulcan
      arguments:
        - --log-to-stdout
        - plan
        - --auto-apply
    run:
      command:
        - vulcan
      arguments:
        - --log-to-stdout
        - run
  api:
    replicas: 1
    logLevel: INFO
    resource:
      request:
        cpu: "200m"
        memory: "512Mi"
      limit:
        cpu: "5000m"
        memory: "4Gi"

Deployment Steps

Step 1: Prepare Your Repository

  1. Create your Vulcan project structure:

    your-project/
    ├── config.yaml              # Vulcan configuration
    ├── domain-resource.yaml     # DataOS resource definition
    ├── models/                  # SQL model files
    │   ├── staging/
    │   ├── marts/
    │   ├── dq/                  # Data Quality rule packs (kind: dq)
    │   ├── semantics/           # Semantic models (kind: semantic)
    │   └── metrics/             # Per-metric files
    ├── seeds/                   # Static data files
    └── audits/                  # Audit queries (blocking)
    

  2. Configure config.yaml with your project settings

  3. Generate domain-resource.yaml with vulcan create_deploy_yaml or configure it manually with your DataOS settings
  4. Push your code to a Git repository

Step 2: Create Required Secrets

# Create git-sync secret (if not exists)
ds resource apply -f git-sync-secret.yaml

Step 3: Verify Prerequisites

# Verify depot exists
ds resource -t depot get -n <depot-name> -a

# Verify compute exists
ds resource -t compute get -n <compute-name> -a

# Verify stack exists
ds resource -t stack get -a 

Step 4: Deploy Vulcan Resource

# Generate the deploy manifest if you haven't created it yet
vulcan create_deploy_yaml

# Apply the domain-resource configuration
ds resource apply -f domain-resource.yaml

Step 5: Monitor Deployment

# Get resource status
ds resource -t vulcan -n <data-product-name> get

# Check logs
ds resource -t vulcan -n <data-product-name> logs

Understanding Runtime Entries

Vulcan doesn't run as a single container. When you deploy, DataOS splits it into three components, each with its own runtime and logs:

  • plan - handles deployment preparation (vulcan plan --auto-apply)
  • run - executes your models on schedule (vulcan run)
  • api - serves queries and exposes endpoints (long-running service)

Open the Runtime tab in your DataOS instance and you'll see entries for all three. This is expected.

Which Log to Check

What you're investigating Look at Runtime entry pattern
Model execution results run logs *-r-execute, workflow...run...
Migration, planning, auto-apply plan logs *-mgrt-execute, *-plan-execute
API availability, query issues api logs *-api-*, service...api...

For example, if your resource is called orders-analytics:

  • orders-analyticsv1-mgrt-execute and orders-analyticsv1-plan-execute belong to plan
  • orders-analyticsv1-r-execute and workflowv2alpha...run... entries belong to run
  • orders-analyticsv1-api-* and servicev2alpha...api... entries belong to api (check *-main for API logs, *-sc-1 for GraphQL, *-sc-2 for MySQL)

Fetching Logs via CLI

Use the DataOS CLI to pull logs from a specific component and container:

dataos-ctl resource -t Vulcan -n <resource-name> logs \
  --container-group <container-group> -c <container-name>
What you need --container-group -c
Planning / migration logs <name>-plan-execute main
Model execution logs <name>-run-execute main
API service logs <name>-api main
GraphQL sidecar logs <name>-api sc-1
MySQL sidecar logs <name>-api sc-2

For example, to check execution logs for a resource called orders-analytics:

dataos-ctl resource -t Vulcan -n orders-analytics logs \
  --container-group orders-analytics-run-execute -c main

Why Multiple Entries Appear

You'll often see more than three entries. Here's why:

  • Scheduled runs create new pods. Each time the cron fires, DataOS creates a new workflow pod for the run. Five "Succeeded" entries means five completed scheduled runs. This is normal.
  • API replicas and sidecars. The API pod has multiple containers, each with its own logs:

    Container Log suffix Use it for
    Main API container *-main Core API/service behavior
    GraphQL sidecar *-sc-1 GraphQL-related investigation
    MySQL sidecar *-sc-2 MySQL wire protocol or client connection issues
  • Plan also runs as a workflow. Migration and planning each get their own pod, so you'll see separate entries for those too.

Quick rule of thumb

To verify a scheduled execution went through, open the most recent "Succeeded" run workflow pod. That has the latest vulcan run output.

Spark engines: driver vs executor logs

If your gateway uses Spark, the runtime entries above only tell half the story. Vulcan's run (and plan) pod is the Spark driver: it builds the query plan, ships tasks to your cluster, and collects results. The actual work runs on executors that live on your Spark cluster, not on DataOS.

That split changes where you go to debug:

Symptom Where the log lives How to read it
Vulcan can't reach Spark, auth errors, version mismatches, scheduler exceptions DataOS *-run-execute or *-plan-execute pod dataos-ctl resource -t Vulcan -n <name> logs --container-group <name>-run-execute -c main
Task failed inside a UDF, OOM on a worker, shuffle fetch failures Spark cluster, executor logs Spark master UI at http://<spark-master>:8080, then drill into the application then executors
Driver-side stack trace that points into executor code Both: DataOS shows the symptom, Spark shows the cause Start in DataOS, follow the executor ID in the trace to the Spark UI

A common pattern: a vulcan run in DataOS fails with a multi-line Java stack trace. The top frames are driver-side and visible in *-run-execute logs; the root cause sits in an executor and is only retrievable from the Spark UI. Don't waste cycles re-running the DataOS pod when the answer is in the executor logs.

For the symmetric "is my driver Spark version actually the same as my cluster's?" question, see Verifying Spark version alignment. A version skew is the single most common reason a Spark-backed run pod blows up at startup, and it surfaces as java.io.InvalidClassException in the *-run-execute logs.

Sidecars don't apply to Spark workloads

The sc-1 (GraphQL) and sc-2 (MySQL) sidecars are part of the api pod, not run. Spark workloads don't add new container groups to DataOS. The driver still runs inside the existing *-run-execute container.


Verification

Verify Models in Data Warehouse

Connect to your data warehouse and verify that tables/views have been created:

-- For Snowflake
SHOW TABLES IN SCHEMA <database>.<schema>;

-- Check specific table
SELECT * FROM <database>.<schema>.<table-name> LIMIT 10;

Access Vulcan API

# Test API (if exposed)
curl --location 'https://<env-fqn>/<tenant>/vulcan/<data-product-name>/livez' \
  --header 'Authorization: Bearer <your-token>'