Embrace the Differences Between Development and Production Environments for Data Engineering

Rather than try to align development and production environments for data engineer we should instead move to a world where SQL is the universal interface and compute is a commodity.

Apr 11, 2024

The common wisdom is that your development environment should match your production environment as closely as possible. This increases the confidence that your code will work in production. To make this happen, we have created a variety of tools and techniques such as containerization, infrastructure as code, CI/CD, and configuration as code and incorporated them into our development environments and workflows.

Unfortunately, separation of environments doesn’t work as nicely when it comes to data engineering. Data pipelines are often complex, multi-step processes that require a variety of tools and vendors choreographed with perfect precision over multiple years. Vendors also encourage lock-in and make it difficult to work locally. For example, Snowflake still doesn’t provide a locally hosted version that can be used for testing. Additionally, data volumes significantly impact performance, and code that runs efficiently in development may not meet production SLAs due to different value distributions or query performance. Compared to non-data teams, data teams often require more iteration and time to properly deploy to production. There is a great talk titled "Data - The Land DevOps Forgot" that discusses this in detail.

However, I want to propose embracing the different needs of each environment. The goals of development and production are different. In development, we aim to optimize for iteration speed and maximize the likelihood of success in production. In production, we prioritize consistent quality, performance, and cost. Imagine being able to develop a data application locally using DuckDB and then deploy it to Snowflake. Or take the extreme and opposite approach by using the polished user experience of Snowflake to develop your code but then deploy it to DuckDB and Lambda with tools like BoilingData.

The trends are moving in this direction. We have moved from the separation of storage and compute to their unbundling. It’s not difficult to imagine a world where compute becomes a commodity and we choose the optimal compute engine given our needs for cost, performance, and data volumes. As long as we can guarantee the results, it shouldn't matter where the data runs since it will be stored in an open storage format anyway.

We're not there yet, and while SQL should be a universal interface, each engine still has its own dialect nuances that encourage lock-in. However, progress is being made. Tools such as sqlglot that are making it easier to transpile from one dialect to another. I love the idea of going back to basic SQL, tweaking parameters around cost and performance tradeoffs, and then letting a magical system figure out where the computations should take place. Small jobs can run on a single machine, larger jobs can use modern data warehouses, and massive jobs can run on GPUs. The key idea here is that one should simply write SQL without worrying about where it will be executed.

Twing Data

Discussion about this post