Your Data Warehouse is a Hammer — But Not Everything is a Nail
At scale, specialized workloads demand specialized tools beyond a single data warehouse
Imagine a company that’s collecting tons of data points per day. They start with a traditional OLTP database, but inevitably realize they need OLAP for their analytics workloads – so they start using a data warehouse.
That means they inevitably start building more and more workloads on top of the data warehouse:
The product and application teams are running queries to incorporate reporting data into the customer facing dashboards.
Data science is querying raw events to feed into their data science models.
The analytics team is using dbt to do additional transformation and Looker to build dashboards for the executives and internal teams.
The data team is using it to do complex data transformations to support the use cases above.
The finance team is extracting data using some arcane Looker report into PowerBI
All teams and stakeholders are using the same data tool – that’s the dream, right? Well, in our experience, this setup will only take you so far. At a certain point, you’ll be better served by picking the right tool(s) for the job.
Snowflake, or basically any other data warehouse, acts as a pretty good initial data transformation layer – they do make it dead easy to get up and running with a large data warehouse. But the tradeoff is that it will be expensive unless you finely tune it, and it’ll be quite complex, since you’ll likely have multiple stakeholders trying to achieve different ends. It’s akin to using the same car to cart a family to weekend sports, then go offroading in the Sierra Nevada, to then race Formula One.
Additionally, it won’t give you the same low-level control that another tool would. Spark, for example, tends to be better for a very large data task that you really do need to optimize at a low level, or working with data scientists that do want to bring in a lot more custom logic or custom libraries. Clickhouse, on the other hand, is designed for this very fast querying, but it's limited because it's designed not for joins or any complex SQL queries, but for very fast queries against time series data.
For the example above, the solution may be to keep Snowflake for the last mile of reports that don’t have subsecond performance requirements. And introduce Spark to handle the initial expensive data transformations and data science modeling tasks. And if you’re feeling good, introduce Clickhouse and a tool such as Rill for the interactive internal analytics.
There’s no one size fits all solution in big data, as much as we may want it.
That’s why we’re always reminding clients to be intentional about what they’re using the data for, in order to implement the stack that works best for their needs. We’re also excited about tools like Apache Iceberg and the trend to “bring your own compute” to the data at hand. Using Iceberg allows you to keep the data neutral, and not locked into a single warehouse. Instead, you're able to pick whatever warehouse and compute layer you need to do the data processing.
If you’re beginning to feel pain from a single-warehouse data stack, start with an audit of workloads, and analyze each function's cost-performance dynamics. What are the actual use cases and workloads that your warehouse is powering? For example, maybe data transformation is responsible for 40% of your data budget. Maybe BI accounts for another 20%. Make sure to weigh the costs, in terms of both infrastructure and people, against the relative value of the output. After you have a good sense of that break, you’ll be better positioned to look into more specialized solutions that can efficiently handle the specific workflows. And that means getting to work with the precision of a scalpel, rather than blindly hammering away.
Unsure of where to start? We’re here to help – just hit reply to get in touch.