10,000 database tables isn’t cause for celebration
It’s a sign that your data team doesn’t have enough business context
Teams are storing petabytes of data, yet stakeholders don’t trust the reporting, or even the data points underlying them. Unfortunately, this is a common occurrence when data teams are the recipients of an endless queue of isolated requests, to be completed without further interrogation.
Our best advice on making data teams strategic partners, rather than simply receivers of requirements, might also be counterintuitive: start manually. To understand why, let’s start with the biggest challenges we’ve come across.
Siloed data teams
We see this often in our work: data teams are often too far removed from the product, without strong product management. One team will produce a data point, and then it falls to the data engineering team to transform it into a table or a report that can subsequently inform business decisions. But the feedback loop between getting the data and making it actionable is often broken.
Ramp, recognizing this challenge, has implemented a successful approach that embeds data skills within the business-aligned teams. They allow more contributors to the code base, enabling the organization to tighten the feedback loop between data insights and the business decisions that follow from them.
Requirements without context
Data teams are often on the receiving end of requests from other teams, and are forced to support a growing list of requirements without having the necessary context to understand the business case. As a result, teams can’t make reasonable tradeoffs, and end up developing Frankenstein models that become more costly to maintain.
We’ve seen cases where teams are proud that they have thousands of dbt models, but we think teams should be aiming for the opposite: doing more with less. The benefit is obvious – by simplifying your data, you’ll have less of it to manage. Queries will run faster, insights will be generated faster, and you’ll be able to move faster since you’re dealing with less overhead.
Our approach: Start manually
The best advice we can give is to validate the need and the solution manually at the outset. Data storage is much cheaper than compute – so rely on the fact that you can always rederive the necessary data, rather than automating the end-to-end flow. By starting manually, you will validate that there is a recurring need and only then should you focus on productionalizing the process.
As Kent Beck famously said, “Make it work, make it right, make it fast.” Here’s how to put that into action:
Make it work: Validate the need manually.
Make it right: Productionalize and design for maintainability and scalability, and see how it fits in with your other data flows.
Make it fast: Optimize, reduce costs, and reduce latency after you have real validation that the output is being used.
Despite there being obvious wins in optimizing the data warehouse, the real win comes from getting crystal clear on the requirements and having a deeper understanding of the tradeoffs.
Unlike other software engineering disciplines, data has much more inertia: it’s difficult to remove something once it’s built. If you haven’t gone through the exercise of validating the need first, it may be a lot easier to not build the thing in the first place — especially if there’s a chance it balloons into 10,000 tables.