Data Quality¶

This guide explains how to use Audits, Checks, and Tests together to ensure data quality in your Orders360 project. You'll learn when to use each tool and see complex examples where they work together.

These three tools work as layers of protection for your data. Each serves a different purpose. Together they provide comprehensive coverage.

The Three-Layer Quality Strategy¶

flowchart TB
    subgraph "Layer 1: Audits - Critical Blocking"
        AUDIT[Audits<br/>Block invalid data<br/>Run with model]
        EXAMPLES1["• Primary keys unique<br/>• Revenue non-negative<br/>• Foreign keys valid"]
    end

    subgraph "Layer 2: Checks - Monitoring"
        CHECK[Checks<br/>Track quality trends<br/>Non-blocking]
        EXAMPLES2["• Row count anomalies<br/>• Completeness trends<br/>• Cross-model validation"]
    end

    subgraph "Layer 3: Tests - Logic Validation"
        TEST[Tests<br/>Validate transformations<br/>Unit testing]
        EXAMPLES3["• SQL logic correct<br/>• Expected outputs<br/>• Edge cases"]
    end

    AUDIT --> EXAMPLES1
    CHECK --> EXAMPLES2
    TEST --> EXAMPLES3

    style AUDIT fill:#ffebee,stroke:#d32f2f,stroke-width:3px,color:#000
    style CHECK fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#000
    style TEST fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000

When to Use Each:

Here's a quick guide to help you choose:

Tool	Purpose	Blocks Pipeline?	Best For
Audits	Critical validation	Yes (always)	Business rules, data integrity
Checks	Quality monitoring	No	Trends, anomalies, monitoring
Tests	Logic validation	No	SQL correctness, edge cases

The key difference: Audits stop everything if they fail. Checks and tests warn you, so you can investigate without blocking production.

Quick Reference¶

Audits: Critical Blocking Validation¶

Use audits when: Data must be correct or the pipeline should stop. These are your "must never fail" rules, if an audit fails, something is seriously wrong and you don't want that bad data flowing downstream.

MODEL (
  name sales.daily_sales,
  assertions (
    -- Primary key validation
    not_null(columns := (order_date)),
    unique_values(columns := (order_date)),

    -- Business rules
    positive_values(column := total_revenue),
    accepted_range(column := total_orders, min_v := 0, max_v := 10000)
  )
);

Why audits are useful:

Always blocking - if they fail, execution stops immediately. No bad data gets through
Run automatically with model execution - you don't have to remember to run them
Fast feedback during development - catch issues before they hit production
Use for critical business rules like "revenue must be positive" or "primary keys must be unique"

Think of audits as your bouncer, they check IDs at the door and don't let anyone sketchy in.

Checks: Quality Monitoring¶

Use checks when: You want to monitor trends and detect anomalies over time. Unlike audits, checks don't block your pipeline, they just keep an eye on things and warn you if something looks off.

# checks/daily_sales.yml
checks:
  sales.daily_sales:
    completeness:
      - row_count > 0:
          name: daily_records_exist
          attributes:
            description: "At least one record per day"

    accuracy:
      - anomaly detection for total_revenue:
          name: revenue_anomaly
          attributes:
            description: "Detect unusual revenue patterns"

Why checks are useful:

Non-blocking - warnings, not failures. Your pipeline keeps running even if a check flags something
Historical tracking - see trends over time. You can spot patterns like "revenue always drops on weekends" or "row counts are trending down"
Anomaly detection - statistical analysis. Checks can detect when something is statistically unusual, even if it's not technically wrong
Use for monitoring and alerting. Set up alerts for when checks fail, so you know to investigate.

Checks monitor everything and alert you if something suspicious happens, but they don't stop execution.

Tests: Logic Validation¶

Use tests when: You need to validate SQL transformations and edge cases. These are your unit tests for SQL, they make sure your logic is correct before you deploy it.

# tests/test_daily_sales.yaml
tests:
  - name: test_daily_sales_aggregation
    model: sales.daily_sales
    inputs:
      raw.raw_orders:
        - order_id: ORD-001
          order_date: 2025-01-15
          total_amount: 100.50
    outputs:
      - order_date: 2025-01-15
        total_orders: 1
        total_revenue: 100.50

Why tests are useful:

Unit testing for SQL logic - test your transformations in isolation
Validates expected outputs - make sure you're getting the results you expect
Tests edge cases - what happens with empty data? Null values? Boundary conditions?
Use for development. Catch bugs before they make it to production.

Tests run in a controlled environment before production.

Complex Example: Orders360 Daily Sales¶

Let's see how all three tools work together for the sales.daily_sales model. This is a real-world example that shows you how to layer these tools for maximum protection.

The Model¶

MODEL (
  name sales.daily_sales,
  kind FULL,
  cron '@daily',
  grains (order_date),
  tags ('silver', 'sales', 'aggregation'),
  terms ('sales.daily_metrics'),
  description 'Daily sales summary with order counts and revenue',
  assertions (
    -- Audit: Critical validations
    not_null(columns := (order_date, total_orders, total_revenue)),
    unique_values(columns := (order_date)),
    positive_values(column := total_orders),
    positive_values(column := total_revenue),
    accepted_range(column := total_revenue, min_v := 0, max_v := 1000000)
  )
);

SELECT
  CAST(order_date AS TIMESTAMP)::TIMESTAMP AS order_date,
  COUNT(order_id)::INTEGER AS total_orders,
  SUM(total_amount)::FLOAT AS total_revenue,
  MAX(order_id)::VARCHAR AS last_order_id
FROM raw.raw_orders
GROUP BY order_date
ORDER BY order_date

Layer 1: Audits (Critical Blocking)¶

Why: These rules must never fail. Invalid data should not flow downstream. If revenue is negative or primary keys aren't unique, that's a critical problem that needs to stop everything immediately.

-- audits/revenue_consistency.sql
AUDIT (name assert_revenue_consistency);
-- Ensure revenue matches sum of individual orders
SELECT 
  ds.order_date,
  ds.total_revenue,
  SUM(o.total_amount) as calculated_revenue
FROM @this_model ds
JOIN raw.raw_orders o ON DATE(o.order_date) = ds.order_date
GROUP BY ds.order_date, ds.total_revenue
HAVING ABS(ds.total_revenue - SUM(o.total_amount)) > 0.01;

Attach to model:

MODEL (
  name sales.daily_sales,
  assertions (
    -- ... other audits ...
    assert_revenue_consistency  -- Custom audit
  )
);

Layer 2: Checks (Monitoring)¶

Why: Monitor trends and detect anomalies without blocking the pipeline. You want to know if revenue spikes unexpectedly or if row counts drop, but these might be legitimate business events, so you investigate rather than blocking.

# checks/daily_sales.yml
checks:
  sales.daily_sales:
    # Completeness: Ensure data exists
    completeness:
      - row_count > 0:
          name: daily_records_exist
          attributes:
            description: "At least one record per day"

      - missing_count(order_date) = 0:
          name: no_missing_dates
          attributes:
            description: "All dates must be present"

    # Validity: Check data ranges
    validity:
      - failed rows:
          name: revenue_outliers
          fail query: |
            SELECT order_date, total_revenue
            FROM sales.daily_sales
            WHERE total_revenue > 500000 OR total_revenue < 0
          samples limit: 10
          attributes:
            description: "Revenue outside expected range"

    # Accuracy: Anomaly detection
    accuracy:
      - anomaly detection for total_revenue:
          name: revenue_anomaly
          attributes:
            description: "Detect unusual revenue patterns"

      - anomaly detection for total_orders:
          name: order_count_anomaly
          attributes:
            description: "Detect unusual order volume"

    # Consistency: Cross-model validation
    consistency:
      - failed rows:
          name: revenue_mismatch_with_raw
          fail query: |
            SELECT 
              ds.order_date,
              ds.total_revenue as daily_revenue,
              SUM(o.total_amount) as raw_revenue
            FROM sales.daily_sales ds
            LEFT JOIN raw.raw_orders o 
              ON DATE(o.order_date) = ds.order_date
            GROUP BY ds.order_date, ds.total_revenue
            HAVING ABS(ds.total_revenue - SUM(o.total_amount)) > 1.0
          samples limit: 5
          attributes:
            description: "Daily revenue should match sum of raw orders"

    # Timeliness: Check data freshness
    timeliness:
      - change for row_count >= -20%:
          name: row_count_drop_alert
          attributes:
            description: "Alert if daily records drop more than 20%"

Layer 3: Tests (Logic Validation)¶

Why: Validate SQL logic and edge cases during development. Before you even deploy, you want to make sure your SQL is doing what you think it's doing. Tests catch logic errors early.

# tests/test_daily_sales.yaml
tests:
  - name: test_daily_sales_single_order
    model: sales.daily_sales
    inputs:
      raw.raw_orders:
        - order_id: ORD-001
          order_date: 2025-01-15
          customer_id: CUST-001
          product_id: PROD-001
          total_amount: 100.50
    outputs:
      - order_date: 2025-01-15
        total_orders: 1
        total_revenue: 100.50
        last_order_id: ORD-001

  - name: test_daily_sales_multiple_orders
    model: sales.daily_sales
    inputs:
      raw.raw_orders:
        - order_id: ORD-001
          order_date: 2025-01-15
          total_amount: 100.00
        - order_id: ORD-002
          order_date: 2025-01-15
          total_amount: 200.00
        - order_id: ORD-003
          order_date: 2025-01-15
          total_amount: 50.00
    outputs:
      - order_date: 2025-01-15
        total_orders: 3
        total_revenue: 350.00
        last_order_id: ORD-003

  - name: test_daily_sales_empty_day
    model: sales.daily_sales
    inputs:
      raw.raw_orders: []
    outputs: []

  - name: test_daily_sales_date_grouping
    model: sales.daily_sales
    inputs:
      raw.raw_orders:
        - order_id: ORD-001
          order_date: 2025-01-15 10:00:00
          total_amount: 100.00
        - order_id: ORD-002
          order_date: 2025-01-15 15:30:00
          total_amount: 200.00
        - order_id: ORD-003
          order_date: 2025-01-16 09:00:00
          total_amount: 150.00
    outputs:
      - order_date: 2025-01-15
        total_orders: 2
        total_revenue: 300.00
      - order_date: 2025-01-16
        total_orders: 1
        total_revenue: 150.00

How They Work Together¶

flowchart TB
    subgraph "Development Workflow"
        DEV[Developer writes model]
        TEST[Run Tests<br/>Validate logic]
        PLAN[Run Plan<br/>Apply changes]
    end

    subgraph "Execution Flow"
        EXEC[Model Executes]
        AUDIT_RUN[Audits Run<br/>Block if fail]
        CHECK_RUN[Checks Run<br/>Track trends]
    end

    subgraph "Results"
        PASS[Pass<br/>Data flows]
        FAIL[Fail<br/>Pipeline stops]
        TREND[Trends<br/>Monitor quality]
    end

    DEV --> TEST
    TEST --> PLAN
    PLAN --> EXEC
    EXEC --> AUDIT_RUN
    EXEC --> CHECK_RUN

    AUDIT_RUN -->|Pass| PASS
    AUDIT_RUN -->|Fail| FAIL
    CHECK_RUN --> TREND

    style DEV fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    style TEST fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#000
    style AUDIT_RUN fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#000
    style CHECK_RUN fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#000
    style PASS fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#000
    style FAIL fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#000

Execution Order:

Tests run during development (validate logic) - catch bugs before deployment
Plan applies changes to environment - your changes go live
Model executes transformation - data gets processed
Audits run immediately (block if fail) - critical validation happens right away
Checks run (track trends, don't block) - monitoring happens in the background

Tests happen first, then audits catch critical issues, and checks monitor everything. This layered approach provides comprehensive coverage.

Complex Scenario: Revenue Validation¶

Here's a complex example where audits and checks work together to validate revenue data. This shows you how to use both tools for the same concern, critical blocking vs. monitoring.

The Problem¶

We need to ensure:

Critical: Revenue is always positive (audit - blocks)
Critical: Daily totals match raw order sums (audit - blocks)
Monitoring: Revenue trends are normal (check - warns)
Monitoring: Detect unusual spikes/drops (check - warns)

Solution: Combined Approach¶

Audits (Critical - Blocking):

MODEL (
  name sales.daily_sales,
  assertions (
    -- Basic validation
    positive_values(column := total_revenue),
    not_null(columns := (order_date, total_revenue)),

    -- Complex validation: Revenue consistency
    assert_revenue_matches_raw_orders
  )
);

-- audits/revenue_matches_raw.sql
AUDIT (name assert_revenue_matches_raw_orders);
SELECT 
  ds.order_date,
  ds.total_revenue as daily_total,
  COALESCE(SUM(o.total_amount), 0) as raw_total,
  ABS(ds.total_revenue - COALESCE(SUM(o.total_amount), 0)) as difference
FROM @this_model ds
LEFT JOIN raw.raw_orders o 
  ON DATE(o.order_date) = ds.order_date
GROUP BY ds.order_date, ds.total_revenue
HAVING ABS(ds.total_revenue - COALESCE(SUM(o.total_amount), 0)) > 0.01;

Checks (Monitoring - Non-Blocking):

# checks/revenue_monitoring.yml
checks:
  sales.daily_sales:
    accuracy:
      # Anomaly detection for revenue
      - anomaly detection for total_revenue:
          name: revenue_anomaly_detection
          attributes:
            description: "Detect statistically unusual revenue"

      # Trend monitoring
      - change for total_revenue >= 50%:
          name: revenue_spike_alert
          attributes:
            description: "Alert if revenue increases >50% day-over-day"

      - change for total_revenue <= -30%:
          name: revenue_drop_alert
          attributes:
            description: "Alert if revenue drops >30% day-over-day"

    consistency:
      # Cross-model validation (non-blocking)
      - failed rows:
          name: revenue_vs_raw_check
          fail query: |
            SELECT 
              ds.order_date,
              ds.total_revenue,
              SUM(o.total_amount) as raw_sum,
              ABS(ds.total_revenue - SUM(o.total_amount)) as diff
            FROM sales.daily_sales ds
            LEFT JOIN raw.raw_orders o 
              ON DATE(o.order_date) = ds.order_date
            GROUP BY ds.order_date, ds.total_revenue
            HAVING ABS(ds.total_revenue - SUM(o.total_amount)) > 10.0
          samples limit: 5
          attributes:
            description: "Monitor revenue consistency (wider tolerance than audit)"

Why Both?

You might wonder why you need both an audit and a check for revenue. Here's the thing:

Audit: Stops pipeline if revenue is wrong (critical). If daily totals don't match raw orders, that's a data integrity issue and everything stops.
Check: Warns about trends and anomalies (monitoring) - if revenue spikes 50% day-over-day, that might be legitimate (big sale!) or it might be a problem, but you want to investigate, not block
Together: Critical issues blocked, trends monitored - you get both immediate protection and ongoing visibility

The audit has a tight tolerance (0.01) because it's checking for correctness. The check has a wider tolerance (10.0) because it's looking for trends, not exact matches. Pretty clever, right?

Running Quality Tools¶

Run Tests¶

# Run all tests
vulcan test

# Run specific test
vulcan test tests/test_daily_sales.yaml::test_daily_sales_single_order

# Run tests matching pattern
vulcan test tests/test_daily*

Run Audits¶

# Run all audits
vulcan audit

# Audits also run automatically with plan
vulcan plan dev

Run Checks¶

# Run all checks
vulcan check

# Run checks for specific model
vulcan check --select sales.daily_sales

# Checks also run automatically with plan/run
vulcan plan dev

Best Practices¶

DO:¶

Here are some tips to help you use these tools effectively:

Start with Audits - Add critical blocking validations first. Get your safety net in place before worrying about trends.
Add Checks Gradually - Monitor trends, then add anomaly detection. Don't try to check everything at once, build up your monitoring over time.
Test During Development - Write tests before deploying. Catch logic errors before they hit production.
Use Descriptive Names - Makes debugging easier. Names like revenue_mismatch_with_raw are much better than check_1.
Order Audits Efficiently - Fast checks first, slow checks last. If you have multiple audits, put the quick ones first so you fail fast.

DON'T:¶

And here's what to avoid:

Don't use Checks for Critical Rules - Use audits instead. If it's critical, it should block. Checks are for monitoring, not blocking.
Don't Skip Audit Failures - Fix the root cause. If an audit fails, something is wrong. Don't just disable it, fix the problem.
Don't Over-Audit - Focus on critical business rules. Too many audits can slow things down. Only audit what really matters.
Don't Ignore Check Trends - They indicate data quality issues. If checks are consistently failing, there's probably a real problem you need to address.

Summary¶

Three-Layer Strategy:

Audits = Critical blocking validation (must pass)
Checks = Quality monitoring (trends, anomalies)
Tests = Logic validation (development)

Use Together:

Audits block invalid data
Checks monitor quality trends
Tests validate SQL logic

Orders360 Example:

Audits ensure revenue is positive and matches raw data
Checks detect anomalies and trends
Tests validate aggregation logic

Next Steps¶

Learn about Built-in Audits
Explore Check Dimensions
Read about Testing