The Curious Case of a Snowflake CTE
Snowflake has an odd behavior when using a HAVING clause within a CTE definition that doesn't exist in other SQL dialects.
One of our core beliefs is that there’s an incredible amount of value locked within the queries we’re running every day, and the way to unlock it is to understand what each query is doing. This means we need to extract all of the tables and columns from a query and how they’re being used - for example whether they’re part of a selection, a join condition, or a where clause, and so on. We use the wonderful sqlglot library (by the team behind SQLMesh) to help us with the parsing and we strive towards successfully parsing and analyzing every query we see.
Unsurprisingly, parsing is tough. There are few things as humbling as trying to parse the wide range of queries generated by a variety of tools (and people) across dozens of different SQL dialects. Before going down this road I never knew about the EXCEPT keyword, and now I do, in fact I now know that it’s synonymous with MINUS.
In any case, a key part of our approach is to infer the schema based on the queries we analyze. For instance, if we see a query such as select a from b
that succeeded, we know that table b
has a column a
, at the time of execution at least. It can get ambiguous though. Take the query select a from b join c on b.id = c.id
. What can you infer from that? You know that tables b
and c
each have an id
column but you do not know which of them has a column named a
. What you can do is see if there’s another query that unambiguously tells you which of the two tables has the a
column and then apply that knowledge to the earlier query. We run a version of this process at scale to infer the schema.
During this inference process we discovered an interesting behavior in Snowflake that we couldn’t find in any other dialect we tested. Take this seemingly simple query:
WITH y (c) AS (
SELECT
SUM(a) AS b
FROM (
SELECT 1 a
) AS x
HAVING c > 0
)
SELECT * FROM y;
It’s a toy example with dummy data but we’re defining a CTE with table name y
and column c
. What’s tricky here is that the HAVING
clause is somehow referencing column c
that was declared outside the CTE definition. The parser tries to be smart and infers the schema based on the query. In this case, it assumes that c
is a column in b
- what else could the HAVING
clause be referring to? You can guess where this is headed. It turns out that in Snowflake the HAVING
clause can refer to a column defined outside the inner scope. In this case the HAVING c > 0
is equivalent to HAVING SUM(a) > 0
. The fix was to have the CTE column aliasing occur before the rest of the processing, but only for Snowflake.
What’s even more interesting about it is that the following query that is referencing an alias declared within the CTE fails to execute:
WITH y (c) AS (
SELECT
SUM(a) AS b
FROM (
SELECT 1 a
) AS x
HAVING b > 0
)
SELECT * FROM y;
But then the following works - because the b
is an implicit column in y
.
WITH y AS (
SELECT
SUM(a) AS b
FROM (
SELECT 1 a
) AS x
HAVING b > 0
)
SELECT * FROM y;
I’d love to know why Snowflake’s implementation here varies from the other dialects since there’s likely an interesting story behind it . If you have any thoughts, ideas, or examples of this behavior in other dialects I’d love to see it. As we discover more examples of similar dialect nuances and challenges we’ll be sure to share them here.