Same Query, Different Hash: An exploration into Snowflake's query hashing blind spots
We dug into cases where query_parameterized_hash falls down, so you don't have to
If you’re a Snowflake user, chances are you’ve encountered the QUERY_HISTORY table. This table contains valuable information about every query that ran in your data warehouse, from the query text, the start and end times, the warehouse and size it ran on, to various query performance metrics.
To provide additional insight, Snowflake also offers the query_hash and the query_parameterized_hash fields. Query_hash lets you quickly filter to all runs of the same query text, while query_parameterized_hash replaces parameters and allows you to look at all queries with the same structure but different literal values.
A simple example is “select name from customer where id = 1” and “select name from customer where id = 2”. These two queries would have different query_hash values but the same query_parameterized_hash.
Given how useful the query_parameterized_hash could be, we wanted to dig in and understand its limitations. We went through a variety of queries, and looked at the query history to see how Snowflake handled the query hashing. We found that, other than the simplest cases, Snowflake takes a rudimentary approach that misses many query structures that should have generated the same query_parameterized_hash.
IN statements
SELECT a from b where id in (1);
SELECT a from b where id in (2);
SELECT a from b where id in (1,2,3);
We would expect that the above would all have the same query_parameterized_hash; instead, Snowflake treats them as unique. Even if you claim that the third query with multiple IN values is different than the others, the first two should still have the same query_parameterized_hash.
Table aliases
SELECT name from customers where customer_id = 1;
SELECT c.name from customers as c where c.customer_id = 1;
SELECT d.name from customers as d where d.customer_id = 1;
In this case, we are aliasing the customers table within the query, but the same physical table is being referenced each time. Unfortunately, Snowflake assigns each of these a different query_parameterized_hash.
Table qualification
SELECT name from customers where customer_id = 1;
SELECT name from public.customers where customer_id = 1;
SELECT name from testing.public.customers where customer_id = 1;
Similar to other data warehouses, Snowflake tables are qualified with a catalog (“database”), a schema, and a table name. If you run a query within a database and a default schema, you don’t need to explicitly qualify the table in the query and can references it via short form: customers vs testing.public.customers. These queries show the same query being executed on the same physical table—but yet again, Snowflake gives each a unique query_parameterized_hash.
Column aliases
select name from customers where customer_id = 1;
select name as other_name from customers where customer_id = 1;
In this case, simply aliasing the column causes an entirely different query_parameterized_hash to be generated.
Column order
select name, city from customers where customer_id = 1;
select city, name from customers where customer_id = 1;
The same data is returned, but the column order is different. Once again, Snowflake gives these different query_parameterized_hash values.
CTE Order
with c as (select name from customers where customer_id = 1),
d as (select name from customers where customer_id = 2)
select c.name, d.name from c join d on 1=1;
with d as (select name from customers where customer_id = 2),
c as (select name from customers where customer_id = 1)
select c.name, d.name from c join d on 1=1;
If Snowflake couldn’t handle the simpler cases above, it’s clear that it won’t handle this one—but we included it for the sake of completeness. The only difference in the queries above is the order of the CTEs. The actual query being executed is the same, and Snowflake should have given them the same query_parameterized_hash.
Twing Data’s Approach to Query Hashing
Snowflake has significant limitations on how it generates the query_parameterized_hash: it doesn’t do a full equivalence check of the queries against on another, and instead takes a shortcut by simply replacing the literal values (and it doesn’t even do that consistently—such as the “IN” example above).
In light of these shortcomings, Twing Data implements a more thoughtful and aggressive approach to query hashing. We use the sqlglot library to translate every SQL query into an abstract syntax tree. Once we have a tree representation of each query we walk through each of the nodes, running a variety of cleaning and canonicalization functions: we qualify the tables that aren’t qualified, collapse IN statements, and re-order clauses to get to the essence of each query. Only then do we generate the query hash, which allows us to enable better visibility into query patterns.
At Twing Data, we understand that proper visibility into your workloads is necessary for optimization, cost reduction, and effective data modeling. If you're interested in better managing your warehouse costs, or comparing your data infrastructure metrics against some industry benchmarks, sign up for a trial here, or simply email me at dan@twingdata.com.