Exploring Redshift’s Approach to Query Hashing
We dug into Redshift's query hashing to see how it stacked up against Snowflake's approach. Spoiler alert: Redshift came out ahead.
A few weeks ago, we looked at examples where Snowflake’s query_parameterized_hash failed to identify simple variants of a query pattern as the same query. This week, we did the same exercise for Redshift expecting roughly the same behavior – but we were pleasantly surprised. Redshift’s query hashing is much more thoughtful than Snowflake’s: it’s clear they actually parse and analyze the query plan in order to generate the query hash.
There’s a hint that they take a more thoughtful approach in their docs, where they mention that “select * from table” and “select id from table” will generate the same hash if that table has a single column. That wouldn’t be possible from looking solely at the query text without a deeper understanding of the table structure.
Now on to the test cases:
Single values comparisons
SELECT a from (select 1 as id, 'a' as a) b where id in (2);
SELECT a from (select 1 as id, 'a' as a) b where id in (1);
SELECT a from (select 1 as id, 'a' as a) b where id = 2;
SELECT a from (select 1 as id, 'a' as a) b where id = 1;
In this case, Redshift correctly recognized that these were the same query, unlike Snowflake which only correctly handled the strict equality condition. But as soon as we expanded it to “where id in (1, 2),” Redshift generated a different query hash..
Table aliasing
SELECT name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1;
SELECT d.name from (select 1 as id, 'a' as a, 'Test' as name) as d where d.id = 1;
SELECT c.name from (select 1 as id, 'a' as a, 'Test' as name) as c where c.id = 1;
SELECT b.name from (select 1 as id, 'a' as a, 'Test' as name) as b where b.id = 1;
SELECT name from (select 1 as id, 'a' as a, 'Test' as name) b where id = 1;
Redshift detected that these were all the same table and gave each the same hash. Snowflake, on the other hand, generated different query hashes for each.
Table qualification
SELECT name from customers where customer_id = 1;
SELECT name from public.customers where customer_id = 1;
SELECT name from testing.public.customers where customer_id = 1;
Unlike Snowflake, Redshift detected that these queries were all using the same table and gave each the same hash.
Column aliasing
select name as other_name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1;
select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1;
In this case, Redshift generated different query hashes despite the structure of the query being identical.
Column order
select city, name from (select 1 as id, 'a' as city, 'Test' as name) where id = 1;
select name, city from (select 1 as id, 'a' as city, 'Test' as name) where id = 1;
Similarly, here, Redshift didn’t generate the same query hash here despite the same data being returned.
CTE order
with d as (select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 2),
c as (select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1)
select c.name, d.name from c join d on 1=1;
with c as (select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1),
d as (select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 2)
select c.name, d.name from c join d on 1=1;
Once again, Redshift had trouble with the order.
Overall, Redshift did a better job of parsing compared to Snowflake, correctly mapping the cases where the differences were in the table structure (aliasing) – but didn’t do as good of a job for column renaming, or when the the clauses were reordered.
Here’s a quick side by side comparison of Redshift vs Snowflake for all the test cases.