Extracts, PDTs, and Embedded DBs: How BI Tools Shape Data Warehouse Economics
Understanding these caching differences can help optimize query performance and control expenses, especially in consumption-based warehouses like Snowflake
A common trap we’ve fallen into is to separate the cost of the data warehouse and their associated business intelligence tools. It’s natural to think of them as separate, but the workings of a BI tool can have a major impact on your data warehouse operations and the associated bill.
This became clear to us when we spoke to someone who used Microstrategy (it’s not just Bitcoin) as their BI tool. It’s not the hippest tool, but what it does well is a concept called “extracts” – a way of caching the underlying data. Most enterprise-focused tools offer the same functionality:
Tableau and Power BI (traditional BI apps that offer a desktop client) store some data on the client, which avoids the need to hit a database on every request
Looker offers similar functionality via their PDTs, but the PDTs are stored within the database rather than on the client
Newer tools, such as Omni and Rill, improve on this functionality by embedding DuckDB in their webclient, or in a dedicated proxy layer
These sorts of decisions end up affecting the overall cost and performance of the BI system, especially consumption based warehouses like Snowflake.
One dimension to think about here is how intelligent the caching is, and where the cache is stored. The more intelligent the caching, the fewer requests your system will need to make back to the data warehouse – but there’s a limit based on the memory available. An aggressive caching strategy will take a superset of dimensions and metrics and store them client side in case they need to be pulled. This will increase the hit rate of subsequent requests, but requires more memory. A less aggressive caching strategy will reduce the memory footprint but reduce the hit rate. The ideal solution here finds the optimal balance based on predicted query behavior.
The other dimension is how the extract or cached data is stored. As mentioned, Looker will keep it in the underlying data warehouse while others may keep it in an intermediate cache while others may keep it on the client. Tools like Greybeam can intercept your queries and route them to an internal cache and compute layer if they deem it will provide a cost-effective performance boost.
You can imagine an optimal DW/BI system looking like this:
Your data warehouse and BI tools aren’t standalone components. The most effective implementations treat the data warehouse and BI tool as a unified system, where each component's strengths amplify the other's performance.