Twing Data

Twing Data

Share this post

Twing Data
Twing Data
Embrace the Differences Between Development and Production Environments for Data Engineering
User's avatar
Discover more from Twing Data
Newsletter from Twing Data which will contain product and feature announcements as well as thoughts on the data world.
Already have an account? Sign in

Embrace the Differences Between Development and Production Environments for Data Engineering

Rather than try to align development and production environments for data engineer we should instead move to a world where SQL is the universal interface and compute is a commodity.

Dan Goldin's avatar
Dan Goldin
Apr 11, 2024
1

Share this post

Twing Data
Twing Data
Embrace the Differences Between Development and Production Environments for Data Engineering
Share

The common wisdom is that your development environment should match your production environment as closely as possible. This increases the confidence that your code will work in production. To make this happen, we have created a variety of tools and techniques such as containerization, infrastructure as code, CI/CD, and configuration as code and incorporated them into our development environments and workflows.

Unfortunately, separation of environments doesn’t work as nicely when it comes to data engineering. Data pipelines are often complex, multi-step processes that require a variety of tools and vendors choreographed with perfect precision over multiple years. Vendors also encourage lock-in and make it difficult to work locally. For example, Snowflake still doesn’t provide a locally hosted version that can be used for testing. Additionally, data volumes significantly impact performance, and code that runs efficiently in development may not meet production SLAs due to different value distributions or query performance. Compared to non-data teams, data teams often require more iteration and time to properly deploy to production. There is a great talk titled "Data - The Land DevOps Forgot" that discusses this in detail.

However, I want to propose embracing the different needs of each environment. The goals of development and production are different. In development, we aim to optimize for iteration speed and maximize the likelihood of success in production. In production, we prioritize consistent quality, performance, and cost. Imagine being able to develop a data application locally using DuckDB and then deploy it to Snowflake. Or take the extreme and opposite approach by using the polished user experience of Snowflake to develop your code but then deploy it to DuckDB and Lambda with tools like BoilingData.

The trends are moving in this direction. We have moved from the separation of storage and compute to their unbundling. It’s not difficult to imagine a world where compute becomes a commodity and we choose the optimal compute engine given our needs for cost, performance, and data volumes. As long as we can guarantee the results, it shouldn't matter where the data runs since it will be stored in an open storage format anyway.

We're not there yet, and while SQL should be a universal interface, each engine still has its own dialect nuances that encourage lock-in. However, progress is being made. Tools such as sqlglot that are making it easier to transpile from one dialect to another. I love the idea of going back to basic SQL, tweaking parameters around cost and performance tradeoffs, and then letting a magical system figure out where the computations should take place. Small jobs can run on a single machine, larger jobs can use modern data warehouses, and massive jobs can run on GPUs. The key idea here is that one should simply write SQL without worrying about where it will be executed.


Subscribe to Twing Data

Launched 2 years ago
Newsletter from Twing Data which will contain product and feature announcements as well as thoughts on the data world.
1

Share this post

Twing Data
Twing Data
Embrace the Differences Between Development and Production Environments for Data Engineering
Share

Discussion about this post

User's avatar
Building an open data pipeline in 2024
Using Iceberg allows us to pick the optimal "big data" compute environment for the specific requirements we have. There's no need to limit yourself to a…
Apr 26, 2024 â€¢ 
Dan Goldin
10

Share this post

Twing Data
Twing Data
Building an open data pipeline in 2024
Identify unused columns in Snowflake and other data warehouses
Identify unused columns in your data warehouse to reduce cost and improve performance. We provide two ways - one using a Snowflake query and the other a…
Mar 14, 2024 â€¢ 
Dan Goldin
1

Share this post

Twing Data
Twing Data
Identify unused columns in Snowflake and other data warehouses
To CTE or not to CTE: The Case for Subqueries
CTEs have been getting a lot of attention but subqueries offer distinct advantages when using iterative query writing style.
Mar 1, 2024 â€¢ 
Dan Goldin
2

Share this post

Twing Data
Twing Data
To CTE or not to CTE: The Case for Subqueries

Ready for more?

© 2025 Twing Data, Inc
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.