Building a data pipeline using Snowflake primitives

You can get pretty far using Snowflake's core capabilities without needing to fall into the trap of a dedicated data modeling or orchestration tool.

May 03, 2024

In a previous blog post, I described a data architecture that allows us to pick and choose the optimal compute environment by leveraging Iceberg’s open storage format. One of the obvious takeaways was that it was complicated. In this post, I want to explore the opposite approach and highlight how much can be done within the data warehouse itself. I will be using Snowflake as an example, given its comprehensive support for these types of features, though other modern data warehouses offer similar capabilities.

Loading data

The initial step involves bringing the data into Snowflake, and there are several methods to achieve this. One of the simplest is to have the data available on R2/S3 and then ingest it into Snowflake using a variety of ways. One approach is to query it directly using the external table functionality while another approach is to set it up as a stage and load the data into Snowflake’s internal tables. Additionally, Snowflake provides Snowpipe, which can automatically load the data as soon as it lands in your storage layer. Basically, there are a variety of ways to get data into Snowflake and to keep things simple we haven’t even looked at third party vendors, such as Fivetran, Airbyte, Portable, and others, which specialize in pulling data from various sources.

-- From the Snowflake docs
CREATE STAGE my_ext_stage
  URL='s3://load/files/'
  CREDENTIALS=(AWS_KEY_ID='1a2b3c' AWS_SECRET_KEY='4x5y6z');

COPY INTO mycsvtable
  FROM @my_ext_stage/tutorials/dataloading/contacts1.csv;

Transformations

Once the data is in Snowflake, or at least accessible by Snowflake, we need to process it. This involves defining the transformations required and handling their orchestration. Currently, the standard practice for managing transformations is to utilize tools like dbt to define the models (tables) and their construction. Scheduling is often handled by an orchestration tool, such as Airflow. But they add complexity and Snowflake offers key capabilities that allow us to move transformation and orchestration into the warehouse: Dynamic Tables, Materialized Views, and Tasks.

Dynamic Tables allow us to write a query and define how often it runs based on a target lag. For instance, you can specify that table A is generated by executing a query that combines data from table B and table C, along with additional transformations. The benefit here is that you’re replacing dbt by writing an explicit query and you’re able to avoid a separate orchestration layer by having Snowflake handle the scheduling and the execution. However, managing complex dependency chains and having limited visibility into the inner workings of the process can prove challenging. Additionally, there is the overhead of maintaining a CI/CD pipeline to implement any required changes.

-- From the Snowflake docs
CREATE OR REPLACE DYNAMIC TABLE product
  TARGET_LAG = '20 minutes'
  WAREHOUSE = mywh
  REFRESH_MODE = auto
  INITIALIZE = on_create
  AS
    SELECT product_id, product_name FROM staging_table;

Snowflake’s Materialized Views differ from traditional views because they eliminate the need for explicit refreshing or referencing. Once the materialized view is defined, Snowflake takes care of making sure it’s always kept up to date with the underlying data. Additionally, Snowflake’s query planner will automatically rewrite queries to use these Materialized Views when optimal. This is incredibly powerful - you can define a series of Materialized Views for common query patterns and then let Snowflake handle both their maintenance and their usage. Query authors benefit from their existence without even knowing that they exist.

-- From the Snowflake docs
-- Example of a materialized view with a range filter
create materialized view v1 as
  select * from table1 where column_1 between 100 and 400;

-- Example of a query that might be rewritten to use MV
select * from table1 where column_1 between 200 and 300;

Although dynamic tables and materialized views offer valuable capabilities, there are cases where they may not be sufficient. There are scenarios where you may want to execute a specific job at a consistent interval. It might be due to the complexity of figuring out the dependency chain of dynamic tables or performance variations that are too difficult to track, monitor, and control. In that case Snowflake offers tasks which execute a query on a fixed cadence.

-- From the Snowflake docs
CREATE TASK exttable_refresh_task
WAREHOUSE=mywh
SCHEDULE='5 minutes'
  AS
ALTER EXTERNAL TABLE mydb.myschema.exttable REFRESH;

Putting it all together

Here’s a breakdown of how it fits together.

The data is stored on R2, an S3 alternative, and Snowflake offers a variety of ways to have the data loaded. We can use Snowpipe to automatically ingest new data as it arrives, manually load it using external stages, or simply keep the data on R2 and reference it via external tables.
We use Snowflake’s Dynamic Tables, Materialized Views, and Scheduled Tasks to handle all our data transformation and modeling requirements.
Metabase is a lightweight BI tool which we use to reference the views and tables within Snowflake. To keep things controlled and well governed we limit the assets that Metabase can query.
Python, or any other programming language, can be used to query the data within Snowflake.
Similarly, we can provide programmatic access to the raw data when necessary.

This approach has significantly fewer pieces than the architecture from last week but functions very much the same way. In fact, I’d start with something like this first before going down the rabbit hole of the modern data stack and only switch to an alternative when warranted.

A reason to move away from this approach would be cost and complexity. By relying on a single tool, we essentially “flatten” our data architecture. This works when we have a few nodes with clear relationships between them but as the number of nodes increases it becomes more and more difficult to comprehend what’s happening. Since a lot takes place behind the scenes, visibility into the process and its failure modes is limited. To scale this approach, you need to have a good understanding of both your workloads and how Snowflake triggers the refreshes. At some point, it may be more effective to move it out of Snowflake and create an explicit, version-controlled solution using a more mature CI/CD process.

Twing Data

Discussion about this post