The evolution of a data stack
This post describes the seven stages a company's data stack goes through as it matures from a startup to a large enterprise.
I made a short post on LinkedIn about the evolution of a data stack and wanted to use this post to do a deeper dive.
As an early stage entrepreneur, you need to be talking to customers and I’ve had hundreds of conversations with prospects and data practitioners across multiple industries, companies, and roles. As expected, we might all be “in the data space” but we all have unique challenges, approaches, and tools.
Despite this diversity, a common pattern emerged around the data stack a company was using. This observation led me to categorize these data stacks into distinct stages, each marked by a set of business needs and corresponding tools. Whether you're just starting out or looking to optimize your existing data infrastructure, understanding these stages can provide valuable insights into how best to evolve your data stack in tandem with your company's growth.
OLTP
Likely MySQL or PostgreSQL. You just started out, and you use your database to power your business. You have a small team that’s majority engineers who all know how to write SQL or use an ORM. Over time you’ve built up a series of files with names such as “helpful_queries.sql” or “reports.sql” that compile a set of predefined queries that you copy and paste from whenever you need some data. You don’t need a BI team and instead you have engineers posting the results of queries into a Google Docs spreadsheet. You don’t have the scale or complexity where performance matters and it’s not worth setting up a BI tool, instead you just give read-only access with some predefined queries.
OLTP → BI
You’ve successfully scaled, and have enough non-engineers that need access to the data. You want to build a data driven culture and democratize data access, but you want to make sure the engineering team is focused on building the product. To solve this, you introduce a BI tool that connects directly to your transactional database. Over time, you build up official views that have standard definitions for your metrics. At some point, you realize that it’s not great having a reporting query running against your production database, so you create a read-replica that’s connected to your BI tool.
OLTP → OLAP → BI
Now you have a lot of data ,and people are complaining about the time it takes reports to run and dashboards to load. You’ve invested in materialized views and put together a Rube Goldberg machine of jobs to maintain these views, but it’s still not enough. Now it’s time to invest in an OLAP database, such as Snowflake. You have to do a bit of manipulation to get the data into Snowflake, and there are a variety of tools that will do this in exchange for dollars, but you decide you can do it yourself via some simple cron jobs. You now have a high performance data warehouse that’s designed for analytics that automatically scales up and down when you need it. The data is still the same as what exists in your original database, but no one is complaining about performance issues.
OLTP → OLAP → ETL → BI
Now you’re in the big leagues. You discover that to take full advantage of your data warehouse, you need to pull in data from your various partners so your warehouse becomes a “single source of truth.” You invest in an ETL tool, such as Fivetran or Portable, to help you pull these disparate datasets into your data warehouse. The data is simple enough that you create views of common joins and schedule various rollups and materializations via cron jobs.
OLTP → OLAP → ETL → Modeling -> BI
At this point, you realize that the data you copied over from your application database isn’t really designed for analytics-style querying: it needs to be transformed to take full advantage of your data warehouse. You don’t want to do this manually, so you invest in a proper analytics engineering function that brings in a data modeling tool, such as dbt or SQLMesh as well as an orchestration layer, such as Airflow. You don’t like the fact that you’ve introduced multiple new technologies, vendors, and roles into your stack but you’re happy with the results.
OLTP → OLAP → ETL → Modeling -> Reverse ETL + BI
You quickly discover that your data warehouse isn’t just for reporting. Due to your efforts, you have meticulously cleaned data coming in from various sources and you’re able to generate insights for multiple business functions. You can take your customer data and feed it back into your CRM tools. You can take your advertising performance data and feed it back into your advertising tools. You can also share this clean and enhanced data with your vendors and partners. There’s no limit to where your data can go. At this point you also discover that different teams have started using their own BI tools.
OLTP → OLAP → ETL → Modeling -> BI + Governance + Observability
You start having nightmares about the modern data stack. You worry that given the complexity of the flows one small mistake in an input cascades into incorrect reports and faulty decision making. You can throw more people at the problem but what will they do? Instead you remember that you went to a conference and saw a pretty cool presentation by an observability vendor that continues to send you emails that you’ve been ignoring. Maybe now’s the time to reach out and see if it makes sense to give them a shot; worst case you get a free dinner out of it. While you’re looking at vendors you also decide to introduce a governance and cataloging tool to ensure teams understand the data they’re using. You’re not sleeping soundly every night but while not perfect these tools are useful and do reduce the risk. You can rest assured that you’ve taken reasonable precautions in case there is a problem.
How did we end up here and where do we go?
As your company grew, so did the demands on your data. Each decision made sense individually, but they collectively created a complex system that cannot be easily rebuilt. Starting from scratch is tempting, but there are countless use cases that you are responsible for (and sometimes unaware of). So you make changes around the edges to reduce costs and improve performance, without delving too deep due to critical processes and stakeholders involved.
This is a perfect representation of Gall’s Law, which states that a complex system that works has evolved from a simple system that worked. A complex system designed from scratch never works and cannot be fixed. Starting over with a working simple system is necessary.
But it's not all doom and gloom. While things may have been simpler in the "good old days," you were constantly paged due to missing data. Now, with a mature data team and infrastructure, you understand the contours of the business, even if some details are sometimes lost.
Your focus should be on simplifying as much as possible. To achieve this, it is important to instrument your data to gain insights into its usage. Once you understand how customers, both internal and external, interact with your data, you can take necessary actions. For example, get rid of unused dashboards and reports (safely deprecating them), eliminate orphaned tables and columns, consolidate similar tables created by different teams, and define officially supported data sources to promote their usage.
The process of managing data is neverending, as requirements continually evolve and the industry churns out new tools and innovations. The booming field of AI is powered by – you guessed it – data. Amidst this constant change, it is crucial to think about what will remain stable. Jeff Bezos said it best:
I very frequently get the question: 'What's going to change in the next 10 years?' And that is a very interesting question; it's a very common one. I almost never get the question: 'What's not going to change in the next 10 years?' And I submit to you that that second question is actually the more important of the two -- because you can build a business strategy around the things that are stable in time.
So, what will not change in data? What will remain true in 10 years? For me, it’s the perpetual need for simplicity and efficiency.