The three Snowflake queries big data doesn’t want you to know about
Leverage Snowflake’s metadata for big cost savings. It's all there, you just need to know where to look.
We’ve been talking about some shortcomings of Snowflake recently (see this post about query hashing blind spots), so today we’re changing tacks to give some credit where credit is due. Snowflake actually does a great job capturing all sorts of query metadata that you can use to dig in to explore your warehouse usage. Below, we’re sharing three particular queries that have been incredibly helpful in helping us analyze our customers’ data warehouses and identify optimization opportunities.
Table size vs Table usage
with usage as (
select
obj.value:objectId as table_id,
obj.value:objectName::string table_name,
APPROX_COUNT_DISTINCT(query_id) as num_queries,
APPROX_COUNT_DISTINCT(case when array_size(objects_modified) = 0 then query_id end) as num_select_queries,
min(query_start_time) as first_usage,
max(query_start_time) as last_usage
from snowflake.account_usage.access_history,
table(flatten(base_objects_accessed)) obj
where query_start_time >= '2025-02-01'
group by all
)
select at.table_id, at.table_catalog, at.table_schema, at.table_name, at.table_owner, at.table_type, at.is_transient, at.clustering_key,
at.row_count, at.bytes / 1e6 as megabytes, at.created, at.last_altered, at.last_ddl,
at.table_catalog || '.' || at.table_schema || '.' || at.table_name as fully_qualified_name,
u.num_queries, u.num_select_queries, u.first_usage, u.last_usage
from (select *
from snowflake.account_usage.TABLES
where deleted is null
) at
left join usage u on at.table_id = u.table_id
order by at.bytes desc;
Snowflake provides a neat view SNOWFLAKE.ACCOUNT_USAGE.TABLES that has all sorts of metadata about each table, ranging from size and number of rows, to the most recent time both the table structure and contents have been modified. But what makes this even more powerful is joining it against the SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY. This lets you see how often this table is actually being queried, and you can also see whether it’s the target of a changing operation (eg UPDATE, INSERT, etc) or not (eg SELECT).
You will likely discover that you have tables that haven’t been touched in years that you can safely remove. You will also notice that some tables have a disconnect between the number of times they’re written vs read — another optimization candidate. And finally, you take the additional step of looking at how the largest tables are used to see if you actually need them to be as big as they are. Two examples here are fields that are rarely queried, as well as the lookback period for time series data — you may have a table that goes back multiple years in history and discover that no one has written a query looking back more than 6 months.
Usage of columns for a single table
WITH parsed_access_history AS (
SELECT
QUERY_ID,
query_start_time,
OBJECT_NAME,
COLUMN_NAME
FROM (
SELECT
QUERY_ID,
query_start_time,
OBJECTS.VALUE:objectName::STRING AS OBJECT_NAME,
COLUMNS.VALUE:columnName::STRING AS COLUMN_NAME
FROM
snowflake.account_usage.access_history,
TABLE(FLATTEN(INPUT => BASE_OBJECTS_ACCESSED)) OBJECTS,
TABLE(FLATTEN(INPUT => OBJECTS.VALUE:columns)) COLUMNS
WHERE query_start_time >= '2025-02-01'
)
)
SELECT
column_name,
count(1) as cnt
FROM
parsed_access_history
WHERE
OBJECT_NAME = 'DB.SCHEMA.TABLE'
GROUP BY all;
Another use of the SNOWFLAKE.ACCOUNT_USAGE.TABLES view is to look at the usage of a table by column. Snowflake also has the BASE_OBJECTS_ACCESSED column as well as the DIRECT_OBJECTS_ACCESSED column, which allows you to see both the views (direct) as well as the underlying table (base) data being accessed. You can also join these against the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY view to pull the query type and the query text to dig in more.
This is useful to see if some columns have drastically different usage patterns which can signal that it may be worth creating other versions of the table, especially if they are high cardinality.
Inefficient warehouse idle settings
WITH warehouse_events AS (
SELECT *
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_EVENTS_HISTORY e
WHERE event_name IN ('RESUME_WAREHOUSE', 'SUSPEND_WAREHOUSE')
AND event_state = 'STARTED'
AND e.warehouse_name = 'WH_NAME'
AND e.timestamp >= '2025-02-01'
AND e.timestamp < '2025-03-01'
),
minute_bounds AS (
SELECT
MIN(DATE_TRUNC('MINUTE', CAST(timestamp AS timestamp))) as min_ts,
MAX(DATE_TRUNC('MINUTE', CAST(timestamp AS timestamp))) as max_ts
FROM warehouse_events
),
minute_sequence AS (
SELECT
DATEADD(minute, seq4(), (SELECT min_ts FROM minute_bounds))::timestamp as minute_start
FROM TABLE(GENERATOR(ROWCOUNT => 50000))
WHERE minute_start <= (SELECT max_ts FROM minute_bounds)
),
active_periods AS (
SELECT
warehouse_id,
warehouse_name,
event_name,
CAST(timestamp AS timestamp) as start_time,
LEAD(CAST(timestamp AS timestamp)) OVER (PARTITION BY warehouse_id ORDER BY timestamp) as end_time
FROM warehouse_events
),
minute_activity AS (
SELECT
m.minute_start,
DATEADD(minute, 1, m.minute_start) as minute_end,
a.warehouse_id,
a.warehouse_name,
GREATEST(a.start_time, m.minute_start) as period_start,
LEAST(COALESCE(a.end_time, CURRENT_TIMESTAMP()), DATEADD(minute, 1, m.minute_start)) as period_end
FROM minute_sequence m
LEFT JOIN active_periods a
ON a.event_name = 'RESUME_WAREHOUSE'
AND m.minute_start < COALESCE(a.end_time, CURRENT_TIMESTAMP())
AND DATEADD(minute, 1, m.minute_start) > a.start_time
),
base_activity AS (
SELECT
COALESCE(warehouse_id, (SELECT warehouse_id FROM warehouse_events LIMIT 1)) as warehouse_id,
COALESCE(warehouse_name, 'WH_NAME') as warehouse_name,
minute_start as timestamp,
period_end > period_start as is_active,
SUM(COALESCE(
DATEDIFF('microsecond', period_start, period_end) / 60000000.0,
0
)) as pct_active
FROM minute_activity
GROUP BY ALL
HAVING pct_active > 0 AND pct_active < 1
),
grouped_activity AS (
SELECT
*,
DATEDIFF('minute', LAG(timestamp) OVER (ORDER BY timestamp), timestamp) != 1 AS new_group
FROM base_activity
),
numbered_groups AS (
SELECT
*,
SUM(CASE WHEN new_group THEN 1 ELSE 0 END) OVER (ORDER BY timestamp) as group_id
FROM grouped_activity
),
counts_for_min AS (
SELECT
warehouse_id,
warehouse_name,
MIN(timestamp) as start_timestamp,
COUNT(*) as minutes_active,
AVG(pct_active) as avg_pct_active
FROM numbered_groups
GROUP BY warehouse_id, warehouse_name, group_id
ORDER BY start_timestamp
)
SELECT
DATE(start_timestamp) as ymd,
minutes_active,
COUNT(1) as cnt
FROM counts_for_min
GROUP BY ALL
ORDER BY ymd, minutes_active;
This one is long, but does something simple: it computes the consecutive minutes where a warehouse has been going through a resume/suspend loop. You probably already know that you can reduce your Snowflake cost by coming up with a more intelligent idle timeout, versus the default which is almost always set too high at 5 minutes. What you may not know — hidden in the Snowflake docs — is that you are going to be charged for at least 60 seconds, even if your warehouse runs for 10 seconds. That means if your warehouse goes through multiple resume/suspend sessions during a minute, you will be charged for multiple minutes of activity. The query above is meant to identify these cases where a warehouse resumes, runs a query, then suspends, only to be resumed again within the same minute. This is a sign that the idle timeout may be too low. That doesn’t mean increasing the timeout will be better, since during other times a lower timeout may be ideal, but it usually warrants another look.
In an ideal world, you’d have the following flow for each warehouse session: start the warehouse, run whatever queries you need, then explicitly suspend the warehouse without needing to rely on the auto idle settings. This is only possible in well defined and controlled workloads, and any sort of BI tool usage likely means you’re dealing with limited ability to predict usage so your warehouse will inevitably have periods where it fluctuates between being on and off.
These are just a few example queries to explore what’s happening under the hood. Snowflake captures rich metadata, though its dashboards don’t always make it easily accessible. With some effort, you really can answer an unlimited number of questions — writing the query is the easy part. The real challenge is knowing what to ask.