Python¶
Sometimes you need Python instead of SQL. Use Python models for machine learning, calling external APIs, or implementing complex business logic that's difficult to express in SQL.
Vulcan supports Python models. As long as your function returns a Pandas, Spark, Bigframe, or Snowpark DataFrame, you can use Python.
When to use Python models:
-
Building machine learning workflows
-
Integrating with external APIs
-
Complex transformations that are easier in Python
-
Data processing that benefits from Python libraries
Unsupported Model Kinds
Python models don't support these model kinds. If you need one of these, use a SQL model instead:
- `VIEW` - Views need to be SQL
- `SEED` - Seed models load CSV files (SQL only)
- `MANAGED` - Managed models require SQL
- `EMBEDDED` - Embedded models inject SQL subqueries
Definition¶
Create a Python model by adding a .py file to your models/ directory and defining an execute function.
Here's what a basic Python model looks like:
How it works:
The @model decorator captures your model's metadata (just like the MODEL DDL in SQL models). You specify column names and types in the columns argument, this is required because Vulcan needs to create the table before your function runs.
Function signature: Your execute function receives:
-
context: ExecutionContext- For running queries and getting time intervals -
start,end- Time range for incremental models -
execution_time- When the model is running -
**kwargs- Any other runtime variables
Return types: You can return Pandas, PySpark, Bigframe, or Snowpark DataFrames. If your output is large, you can also use Python generators to return data in chunks for memory management.
@model Specification¶
The @model decorator accepts the same properties as SQL models, just use Python syntax instead of SQL DDL. name, kind, cron, grains, etc. They all work the same way.
Python model kinds are specified with a Python dictionary containing the kind's name and arguments. All model kind arguments are listed in the models configuration reference page.
from vulcan import ModelKindName
@model(
"sales.daily_sales",
kind=dict(
name=ModelKindName.INCREMENTAL_BY_TIME_RANGE,
time_column="order_date",
),
)
All model kind properties are documented in the model configuration reference.
Supported kind dictionary name values are:
-
ModelKindName.VIEW -
ModelKindName.FULL -
ModelKindName.SEED -
ModelKindName.INCREMENTAL_BY_TIME_RANGE -
ModelKindName.INCREMENTAL_BY_UNIQUE_KEY -
ModelKindName.INCREMENTAL_BY_PARTITION -
ModelKindName.SCD_TYPE_2_BY_TIME -
ModelKindName.SCD_TYPE_2_BY_COLUMN -
ModelKindName.EMBEDDED -
ModelKindName.CUSTOM -
ModelKindName.MANAGED -
ModelKindName.EXTERNAL
Execution Context¶
Python models can do anything you want, but it is strongly recommended for all models to be idempotent. Python models can fetch data from upstream models or even data outside of Vulcan.
Fetching data: Use context.fetchdf() to run SQL queries and get DataFrames:
Resolving table names: Use context.resolve_table() to get the correct table name for the current environment (handles dev/prod prefixes automatically):
table = context.resolve_table("vulcan_demo.products")
df = context.fetchdf(f"SELECT * FROM {table}")
Best practice: Make your models idempotent, running them multiple times should produce the same result. This makes debugging and restatements much easier.
Optional Pre/Post-Statements¶
You can run SQL commands before and after your Python model executes. This is useful for setting session parameters, creating indexes, or running data quality checks.
Pre-statements: Run before your execute function
Post-statements: Run after your execute function completes
You can pass SQL strings, SQLGlot expressions, or macro calls as lists to pre_statements and post_statements.
Concurrency
Be careful with pre-statements that create or alter physical tables, if multiple models run concurrently, you could get conflicts. Stick to session settings, UDFs, and temporary objects in pre-statements.
Project-level defaults: You can also define pre/post-statements in model_defaults for consistent behavior across all models. Default statements run first, then model-specific ones. Learn more in the model configuration reference.
The previous example's post_statements called user-defined Vulcan macro @CREATE_INDEX(@this_model, id).
We could define the CREATE_INDEX macro in the project's macros directory like this. The macro creates a table index on a single column, conditional on the runtime stage being creating (table creation time).
Alternative approach: Instead of using the @model decorator's pre_statements and post_statements, you can execute SQL directly in your function using context.engine_adapter.execute().
Important: If you want post-statements to run after your function completes, you need to use yield instead of return. Post-statements specified after a yield will execute after the function finishes.
This example function includes both pre- and post-statements:
Optional On-Virtual-Update Statements¶
On-virtual-update statements run when views are created or updated in the virtual layer. This happens after your model's physical table is created and the view pointing to it is set up.
Common use case: Granting permissions on views so users can query them.
You can set on_virtual_update in the @model decorator to a list of SQL strings, SQLGlot expressions, or macro calls.
Project-level defaults: You can also define on-virtual-update statements at the project level using model_defaults in your configuration. These will be applied to all models in your project (including Python models) and merged with any model-specific statements. Default statements are executed first, followed by model-specific statements. Learn more about this in the model configuration reference.
Virtual Layer Resolution
These statements run at the virtual layer, so table names resolve to view names, not physical table names. For example, in a dev environment, vulcan_demo.model_with_grants and @this_model resolve to vulcan_demo__dev.model_with_grants (the view), not the physical table.
Dependencies¶
In order to fetch data from an upstream model, you first get the table name using context's resolve_table method. This returns the appropriate table name for the current runtime environment:
The resolve_table method will automatically add the referenced model to the Python model's dependencies.
The only other way to set dependencies of models in Python models is to define them explicitly in the @model decorator using the keyword depends_on. The dependencies defined in the model decorator take precedence over any dynamic references inside the function.
You can use global variables or blueprint variables in resolve_table calls. Here's how:
Returning Empty DataFrames¶
Python models can't return empty DataFrames directly. If your model might return empty data, use yield instead of return:
Why? This allows Vulcan to handle the empty case properly. If you return an empty DataFrame, Vulcan will error. If you yield an empty generator or conditionally yield, it works fine.
User-defined variables¶
User-defined global variables can be accessed from within the Python model with the context.var method.
For example, this model access the user-defined variables var and var_with_default. It specifies a default value of default_value if variable_with_default resolves to a missing value.
Alternatively, you can access global variables via execute function arguments, where the name of the argument corresponds to the name of a variable key.
For example, this model specifies my_var as an argument to the execute method. The model code can reference the my_var object directly:
Make sure the argument has a default value if it's possible for the variable to be missing.
Note that arguments must be specified explicitly - variables cannot be accessed using kwargs.
Python Model Blueprinting¶
Python models can serve as templates for creating multiple models. This is called "blueprinting", you define one model template, and Vulcan creates multiple models from it.
How it works: You parameterize the model name with a variable (using @{variable} syntax) and provide a list of mappings in blueprints. Vulcan creates one model for each mapping.
Use case: When you have similar models that differ only by a few parameters (like different schemas, regions, or customers).
Here's an example that creates two models:
Important: Notice the @{customer} syntax in the model name. The curly braces tell Vulcan to treat the variable value as a SQL identifier (not a string literal). Learn more about this syntax here.
Dynamic blueprints: You can generate blueprints dynamically using macros. This is handy when your blueprint list comes from external sources (like CSV files or API calls):
For example, the definition of the gen_blueprints may look like this:
It's also possible to use the @EACH macro, combined with a global list variable (@values):
Using Macros in Model Properties¶
Python models support macro variables in model properties, but there's a gotcha when macros appear inside strings.
The issue: Cron expressions often use @ (like @daily, @hourly), which conflicts with Vulcan's macro syntax.
The solution: Wrap the entire expression in quotes and prefix with @:
This is necessary because cron expressions often use @ for aliases (like @daily, @hourly), which can conflict with Vulcan's macro syntax.
Examples¶
Here are some practical examples showing different ways to use Python models.
Basic¶
A simple Python model that returns a static Pandas DataFrame. All the metadata properties work the same as SQL models, just use Python syntax.
SQL Query and Pandas¶
A more realistic example: query upstream models, do some pandas processing, and return the result. This shows how you'd typically use Python models in practice:
PySpark¶
If you're using Spark, use the PySpark DataFrame API instead of Pandas. PySpark DataFrames compute in a distributed fashion (across your Spark cluster), which is much faster for large datasets.
Why PySpark over Pandas: Pandas loads everything into memory on a single machine. PySpark distributes the work across your cluster, so you can handle much larger datasets.
Snowpark¶
If you're using Snowflake, use the Snowpark DataFrame API. Like PySpark, Snowpark DataFrames compute on Snowflake's servers (not locally), which is much more efficient.
Why Snowpark over Pandas: All computation happens in Snowflake, so you're not moving data to your local machine. Faster, cheaper, and can handle huge datasets.
Bigframe¶
If you're using BigQuery, use the Bigframe DataFrame API. Bigframe looks like Pandas but runs everything in BigQuery.
Why Bigframe over Pandas: All computation happens in BigQuery, so you get BigQuery's scale and performance. Plus, you can use BigQuery remote functions (like in the example below) for custom Python logic.
Batching¶
If your Python model outputs a huge DataFrame and you can't use Spark/Bigframe/Snowpark, you can batch the output using Python generators.
The problem: With Pandas, everything loads into memory. If your output is too large, you'll run out of memory.
The solution: Use yield to return DataFrames in chunks. Vulcan processes them one at a time, so you never have more than one chunk in memory at once.
Here's how you'd do it:
Serialization¶
Vulcan executes Python models locally (wherever Vulcan is running) using a custom serialization framework. This means your Python code runs on your machine or CI/CD environment, not in the database.
Why this matters: You have full access to Python libraries, can make API calls, do ML processing, etc. The database just receives the final DataFrame.