<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Twing Data]]></title><description><![CDATA[Newsletter from Twing Data which will contain product and feature announcements as well as thoughts on the data world.]]></description><link>https://blog.twingdata.com</link><image><url>https://substackcdn.com/image/fetch/$s_!Ir9z!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb34bb89b-4528-44c1-9466-d2cb29766379_490x490.png</url><title>Twing Data</title><link>https://blog.twingdata.com</link></image><generator>Substack</generator><lastBuildDate>Wed, 08 Apr 2026 19:26:05 GMT</lastBuildDate><atom:link href="https://blog.twingdata.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Twing Data, Inc]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[twingdata@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[twingdata@substack.com]]></itunes:email><itunes:name><![CDATA[Dan Goldin]]></itunes:name></itunes:owner><itunes:author><![CDATA[Dan Goldin]]></itunes:author><googleplay:owner><![CDATA[twingdata@substack.com]]></googleplay:owner><googleplay:email><![CDATA[twingdata@substack.com]]></googleplay:email><googleplay:author><![CDATA[Dan Goldin]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[AWS Cost Monitoring with Rill]]></title><description><![CDATA[A lightweight dashboard for AWS cost monitoring using Rill for AWS CUR data]]></description><link>https://blog.twingdata.com/p/aws-cost-monitoring-with-rill</link><guid isPermaLink="false">https://blog.twingdata.com/p/aws-cost-monitoring-with-rill</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Mon, 02 Feb 2026 17:31:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2Lt4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If your finance team asked you tomorrow if there&#8217;s more you could do to cut AWS costs, what would you say? Be honest.</p><p>Everyone knows they should be doing a better job monitoring AWS costs. Almost nobody actually does. It&#8217;s not for lack of interest:even basic tools like AWS Cost Explorer make it surprisingly hard to understand what&#8217;s going on, especially once usage grows beyond a small setup.</p><p>At Twing Data, set out to build something dead simple: point a dashboard at S3, automatically ingest AWS Cost and Usage Report (CUR) files, and instantly get a useful set of dashboards. We wanted to avoid complex modeling, configuration, various ingestions and transformations, etc. The goal was to keep it as simple as possible so you can quickly see where your money is going, how it&#8217;s trending, and where to dig deeper.</p><p>We&#8217;ve gotten to know the <a href="https://www.rilldata.com/">Rill</a> team from working with them on some projects and thought they&#8217;d be a great fit for this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Lt4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Lt4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png 424w, https://substackcdn.com/image/fetch/$s_!2Lt4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png 848w, https://substackcdn.com/image/fetch/$s_!2Lt4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png 1272w, https://substackcdn.com/image/fetch/$s_!2Lt4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Lt4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png" width="1456" height="826" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:826,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Lt4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png 424w, https://substackcdn.com/image/fetch/$s_!2Lt4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png 848w, https://substackcdn.com/image/fetch/$s_!2Lt4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png 1272w, https://substackcdn.com/image/fetch/$s_!2Lt4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d5b5d-df2c-4c6c-ab2e-2cdabe8779fc_1600x908.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>What you get out of the box</h2><p>With this setup, you get a working cost dashboard almost immediately:</p><ul><li><p>Automatic ingestion and partitioning of AWS Cost and Usage Report (CUR) data from S3</p></li><li><p>Default dashboards showing cost trends and breakdowns</p></li><li><p>Fast slice-and-dice without writing SQL</p></li><li><p>An AI chat layer for ad-hoc questions and analysis</p></li></ul><p>The goal is to give a V1 of a robust cost monitoring system that gives you a clear, usable snapshot of costs with minimal effort and then lets you go deeper if and when you need to.</p><h2>Why Rill works well here</h2><p>A few core Rill features do most of the heavy lifting:</p><ol><li><p><strong>Automatic ingestion from S3: </strong> Point Rill at your CUR bucket and it handles loading new files as they arrive.</p></li><li><p><strong>Partition-aware updates: </strong>New cost files are ingested incrementally on a schedule, so the dashboard stays current without reprocessing everything.</p></li><li><p><strong>ClickHouse under the hood: </strong>This scales far beyond &#8220;toy demo&#8221; size. You can throw very large CUR datasets at it without worrying about performance. Rill also provides a nice way to test changes on &#8220;dev&#8221; before pushing them to production.</p></li></ol><h2>Getting started</h2><p>Note: If you do not already export your Cost and Usage Reports to a bucket follow this quick setup.</p><ol><li><p>In console search for `Billing and Cost Management` &gt; `Data Exports`</p></li><li><p>Select Create</p><ol><li><p>Legacy CUR export</p></li><li><p>Include resource IDs</p></li><li><p>Refresh automatically with your expected cadence Hourly, Daily, Monthly</p></li></ol></li><li><p>Configure the bucket to share with Rill</p></li></ol><p>The repo lives here: <a href="https://github.com/Twing-Data/aws-cost-dashboard">https://github.com/Twing-Data/aws-cost-dashboard</a></p><p>The setup is deliberately minimal:</p><ol><li><p>Clone the repo</p></li><li><p>Configure your .env file with AWS and S3 settings</p></li><li><p>Run rill start</p></li></ol><p>Or in code:</p><p>Install rill if you don&#8217;t have it:</p><pre><code>curl https://rill.sh | sh
git clone https://github.com/Twing-Data/aws-cost-dashboard.git
cd aws-cost-dashboard
rill start</code></pre><p>That&#8217;s it. Once Rill starts, it ingests the data and serves the dashboard locally.</p><h2>Defaults first, simple customization</h2><p>The dashboards are built on top of standard AWS CUR fields so they&#8217;re useful right away. In practice, most teams tag their resources to provide more visibility and measurement across applications, teams, environments, and projects.</p><p>Rill makes this easy to extend: you modify the YAML files in the sources, metrics, and dashboards folders:</p><ul><li><p>Define new fields in the source</p></li><li><p>Reference them in metrics</p></li><li><p>Use them in dashboards</p></li></ul><p>Once added, those fields are immediately available everywhere. Rill avoids the complex overhead of complex modeling or orchestration and quickly gives you a useful dashboard. You can read more docs here to see it in action and all the options Rill offers: <a href="https://docs.rilldata.com/build/dashboards">https://docs.rilldata.com/build/dashboards</a></p><h2>Augmenting with AI</h2><p>Everyone is incorporating AI into their products and Rill is no exception. Rill includes an AI chat layer that lets you ask questions directly against the dataset. The nice thing is that because it&#8217;s tightly integrated with the rest of the platform you get dashboards that link back to the original data sources so you can verify the numbers, a good way to handle the hallucination risk and AI audit trail.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lc1o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lc1o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png 424w, https://substackcdn.com/image/fetch/$s_!Lc1o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png 848w, https://substackcdn.com/image/fetch/$s_!Lc1o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png 1272w, https://substackcdn.com/image/fetch/$s_!Lc1o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lc1o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png" width="1456" height="945" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:945,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lc1o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png 424w, https://substackcdn.com/image/fetch/$s_!Lc1o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png 848w, https://substackcdn.com/image/fetch/$s_!Lc1o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png 1272w, https://substackcdn.com/image/fetch/$s_!Lc1o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f0e0b7f-cea5-457e-9e69-9a111e15159e_1600x1039.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can also do things like select a portion of the chart that has noisy or interesting data and ask Rill to do a root cause analysis.</p><p>This project was about minimizing setup cost while still delivering something useful. You can be up and running quickly, get useful visibility into AWS costs, and still have a path to deeper analysis and customization as needs grow.</p><p>If AWS cost monitoring has felt heavier than it should be, this is a much lighter starting point. Let us know if you end up giving it a try.</p>]]></content:encoded></item><item><title><![CDATA[Apache Iceberg in Modern Data Architectures: A Comprehensive Report ]]></title><description><![CDATA[Apache Iceberg has rapidly become one of the most talked-about technologies in data architecture over the past few years.]]></description><link>https://blog.twingdata.com/p/apache-iceberg-in-modern-data-architectures</link><guid isPermaLink="false">https://blog.twingdata.com/p/apache-iceberg-in-modern-data-architectures</guid><dc:creator><![CDATA[Adam Kabak]]></dc:creator><pubDate>Wed, 28 May 2025 18:11:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!4opu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4opu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4opu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!4opu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!4opu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!4opu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4opu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2998280,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/164664013?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4opu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!4opu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!4opu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!4opu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07b545de-4446-40b5-8b0f-ea3323fc61df_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Apache Iceberg has rapidly become one of the most talked-about technologies in data architecture over the past few years. As organizations shift towards <strong>data lakehouse</strong> designs that combine the scalability of data lakes with the reliability of data warehouses, open table formats like Iceberg play a pivotal role. <strong>Apache Iceberg</strong> is<a href="https://www.metaplane.dev/blog/apache-iceberg#:~:text=At%20its%20core%2C%20Apache%20Iceberg,and%20your%20query%20engines"> an open-source table format that brings SQL table-like functionality (transactions, schema management, etc.) to large analytic datasets on inexpensive storage (e.g. cloud object stores)</a>. Its rise in popularity is driven by the promise of <strong>high-performance analytics without vendor lock-in</strong>, allowing multiple processing engines to work reliably on the same data. This report delves into what Iceberg is and why it&#8217;s in high demand, examines the latest integration with DuckDB (as of April 2025), explores ideal technology combinations for lakehouse, streaming, and analytics use cases (in cloud and hybrid environments), and surveys other key technologies in the Iceberg ecosystem (Spark, Trino, Flink, Snowflake, BigQuery, etc.). We also highlight tooling and adoption by major organizations (especially for AI/ML workflows), and provide summary-level insights into Iceberg&#8217;s architecture and performance that explain how it all works together.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>What is Apache Iceberg and Why Is It in High Demand?</strong></h2><p><strong>Apache Iceberg</strong> is a high-performance <strong>open table format</strong> for huge analytic tables, <a href="https://www.starburst.io/blog/introduction-to-apache-iceberg-in-trino/#:~:text=What%20is%20Apache%20Iceberg%3F">originally developed at Netflix and now an Apache Software Foundation project</a>. In essence, Iceberg is a &#8220;smart&#8221; layer between raw storage (files in formats like Parquet or ORC) and compute engines. <a href="https://www.dremio.com/resources/guides/apache-iceberg/#:~:text=evolve%20and%20change%20over%20time">It tracks table data and metadata to ensure that multiple analytics applications can all see a consistent view of data and perform </a><strong><a href="https://www.dremio.com/resources/guides/apache-iceberg/#:~:text=evolve%20and%20change%20over%20time">ACID transactions</a></strong><a href="https://www.dremio.com/resources/guides/apache-iceberg/#:~:text=evolve%20and%20change%20over%20time"> on a data lake</a>. This effectively brings the reliability and simplicity of traditional SQL tables to big data stored in distributed files.</p><p><strong>Core Features of Apache Iceberg:</strong></p><ul><li><p><strong>ACID Transactions &amp; Snapshot Isolation:</strong> Iceberg allows <strong>atomic adds, deletes, and updates</strong> to table data, even with concurrent writers, ensuring readers never see partial updates. Each commit creates a new <strong>snapshot</strong> of the table, and readers can isolate to a specific snapshot, so queries are always consistent and <strong>never see &#8220;in-progress&#8221; data</strong>. This means multiple engines or jobs can write to the same table safely, and readers won&#8217;t get corrupted or &#8220;zombie&#8221; data.<br></p></li><li><p><strong>Schema Evolution:</strong> Iceberg natively supports evolving the table schema over time (adding, dropping, or renaming columns) without cumbersome rebuilds. It tracks schema changes as part of the metadata. This reliable schema evolution avoids the headaches of older Hive-style tables where schema changes could be error-prone.<br></p></li><li><p><strong>Time Travel &amp; Rollbacks:</strong> Every snapshot of the table is retained, enabling <strong>time travel queries</strong> &#8211; you can query data <strong>as of</strong> a particular snapshot or time in the past. This is invaluable for auditing and reproducing experiments. It&#8217;s also possible to <strong>rollback</strong> a table to a previous state instantly if a bad data load occurs.<br></p></li><li><p><strong>Hidden Partitioning &amp; Evolution:</strong> Iceberg manages partition information internally rather than relying on physical folder structures. This means users get automatic partition pruning for queries, and partition schemes can be changed as data and query patterns evolve. There&#8217;s no need for manual &#8220;repair&#8221; or handling of Hive partitions &#8211;<a href="https://www.upsolver.com/blog/iceberg-architecture-examples#:~:text=Iceberg%20is%20designed%20for%20slow,have%20historically%20plagued%20Apache%20Hive"> </a><strong><a href="https://www.upsolver.com/blog/iceberg-architecture-examples#:~:text=Iceberg%20is%20designed%20for%20slow,have%20historically%20plagued%20Apache%20Hive">no more expensive directory listings or strict naming conventions</a></strong>. This dramatically reduces the load on metastores and speeds up query planning for large tables.<br></p></li><li><p><strong>Metadata for Performance:</strong> Iceberg maintains detailed metadata (in <strong>manifest files</strong> and <strong>manifest lists</strong>) about the locations and statistics of data files. Query engines leverage this for <strong>advanced planning and filtering</strong>, skipping over unrelated files and only reading the necessary subsets of data. For example, Iceberg can avoid scanning thousands of file paths on cloud storage and instead quickly identify which files contain relevant data, eliminating the costly &#8220;full directory scan&#8221; that plagued older data lakes.<br></p></li><li><p><strong>Scalability to Petabytes:</strong> Iceberg is designed for <strong>petabyte-scale datasets</strong> that are &#8220;slow-moving&#8221; (very large but append/modify occasionally). By tracking individual data files and optimizing how new data is added, Iceberg can handle tables with millions of files and enormous throughput. It also supports features like <strong>compaction</strong> (merging small files) and <strong>split planning</strong> to maintain performance at scale.</p></li></ul><p>These features make Iceberg extremely attractive to data engineering teams. It directly addresses the limitations of earlier &#8220;open table&#8221; approaches (e.g. Hive tables or home-grown metadata), delivering <strong>database-like reliability on a data lake</strong>. As a result, demand for Iceberg has surged. Organizations want the flexibility of keeping data in open file formats on cheap storage, <strong>but with the manageability and trust of a warehouse</strong>. Iceberg provides this balance in a <strong>fully open standard</strong>, avoiding lock-in to any single vendor or engine.</p><p><strong>Why the High Demand?</strong> A few key reasons stand out:</p><ul><li><p><strong>Multi-Engine Interoperability:</strong> Iceberg was built to be engine-agnostic &#8211; many different processing engines can read and write Iceberg tables and <strong>all see the same consistent data</strong>. This &#8220;define once, use anywhere&#8221; approach means companies can adopt best-of-breed tools for each job (Spark for ETL, Trino for SQL, etc.) without duplicating data. It unlocks the elusive goal of a true <strong>open lakehouse</strong>. The excitement around Iceberg has grown as major engines and platforms announced support, confirming it&#8217;s not a niche technology. By 2025, <em>virtually every big data platform either supports Iceberg or has plans to</em>:<a href="https://duckdb.org/2025/03/14/preview-amazon-s3-tables.html#:~:text=In%20recent%20years%2C%20the%20Iceberg,Iceberg%20tables%20grouped%20by%20namespaces"> Databricks, Snowflake, Google BigQuery, Amazon Athena/Glue, Apache Spark, Flink, Trino, Presto, and more have all added Iceberg compatibility</a>. This broad adoption across the industry has accelerated interest in Iceberg as a new standard.<br></p></li><li><p><strong>Reliability and Performance Improvements:</strong> Early adopters like Netflix, Airbnb, and LinkedIn report significant improvements in data reliability and query performance after migrating to Iceberg. For example, Airbnb eliminated heavy Hive Metastore load and latency by replacing Hive tables with Iceberg (no more listing thousands of S3 partitions). Netflix was able to build a streamlined <strong>incremental ETL pipeline</strong> with Iceberg that improved data freshness and accuracy for its analysts and data scientists. These success stories have made Iceberg a go-to solution for organizations struggling with &#8220;data swamp&#8221; issues in their data lakes.<br></p></li><li><p><strong>Scale for Modern Analytics and AI:</strong> <a href="https://www.datagravity.dev/p/databricks-acquires-tabular-what#:~:text=Apache%20Iceberg%20is%20an%20open,processing%20needs%2C%20including%20AI%20workloads">Iceberg&#8217;s ability to </a><strong><a href="https://www.datagravity.dev/p/databricks-acquires-tabular-what#:~:text=Apache%20Iceberg%20is%20an%20open,processing%20needs%2C%20including%20AI%20workloads">efficiently manage petabyte-scale datasets</a></strong><a href="https://www.datagravity.dev/p/databricks-acquires-tabular-what#:~:text=Apache%20Iceberg%20is%20an%20open,processing%20needs%2C%20including%20AI%20workloads"> and keep query speeds high is crucial for modern analytics, including AI/ML workloads</a>. Machine learning models often need access to large historical datasets for training; Iceberg&#8217;s time travel and optimized storage layout make those accesses both <strong>faster and cheaper</strong>. The format ensures data consistency and quality (no partial updates, no corruption), which is critical for reproducible ML experiments. As one analysis notes, Iceberg&#8217;s robust framework for managing large datasets leads to faster and more accurate model training and inference, and this is a big reason it&#8217;s becoming a <strong>preferred choice for data lakes aimed at AI</strong>. <a href="https://www.dremio.com/open-source/apache-iceberg/#:~:text=Exploring%20the%20Depths%20of%20Apache,Unlike%20Delta%20Lake%2C%20Hive">Companies like Apple, Stripe, Expedia, Adobe, and LinkedIn have all embraced Iceberg in parts of their data platforms</a>, underlining its effectiveness and versatility.<br></p></li></ul><p>In short, Apache Iceberg brings <em>order</em> to the chaos of big data file lakes. It provides <strong>warehouse-grade features in an open format</strong>, enabling a new generation of data architecture where various tools can share a single source of truth. This unique value proposition explains the intense demand and rapid growth in adoption we&#8217;re witnessing.</p><h2><strong>DuckDB Meets Iceberg: New Integration in April 2025</strong></h2><p>One of the most exciting recent developments is the integration of <strong>DuckDB</strong> with Apache Iceberg. DuckDB is an in-process analytical SQL database (often called &#8220;SQLite for analytics&#8221;) known for its speed and ease of use, especially in data science workflows. As of early 2025, DuckDB gained the ability to <strong>read Iceberg tables directly</strong>, including those stored remotely on cloud object storage. <a href="https://motherduck.com/blog/duckdb-ecosystem-newsletter-april-2025/#:~:text=An%20alternative%20connection%20method%20uses,changes%20in%20the%20table%27s%20schema">This new feature, introduced as a preview in DuckDB version 1.2.x, allows users to attach Iceberg catalogs and query them with DuckDB&#8217;s SQL engine</a>.</p><p><strong>Technical Capabilities:</strong> DuckDB&#8217;s Iceberg integration is provided via an <strong>iceberg extension</strong>. Initially, DuckDB could read Iceberg tables (since late 2023) if pointed to local metadata. The April 2025 update adds support for <strong>Iceberg REST Catalogs</strong>, which is a big leap forward. An Iceberg REST catalog is a service (defined by the Iceberg community) that exposes table metadata through a REST API. DuckDB can now <strong>ATTACH</strong> to such a catalog with a single command, even if the data and metadata reside in cloud storage. In practical terms, this means DuckDB can connect to <strong>Amazon S3-hosted Iceberg tables</strong> or those managed via <strong>AWS Glue Catalog</strong> (which is used by AWS&#8217;s &#8220;S3 Tables&#8221; and SageMaker Lakehouse) as if they were local databases. For example, a user can run:</p><pre><code>ATTACH 'arn:aws:s3tables:us-east-1:123456789012:bucket/my-iceberg-bucket' AS mydb

(TYPE iceberg, ENDPOINT_TYPE s3_tables);</code></pre><p>and immediately query mydb.my_table in DuckDB, pulling data from S3. DuckDB handles the Iceberg metadata and reads the Parquet files over HTTP/S3. It even supports Iceberg&#8217;s <strong>schema evolution</strong> &#8211; if the table&#8217;s schema is altered by another engine, DuckDB will pick up the change on next query, as demonstrated in tests with adding new columns.</p><p>This integration was developed in collaboration with AWS and coincided with AWS&#8217;s own announcements of Iceberg support. In March 2025, AWS introduced <strong>Amazon S3 Tables</strong> (an Iceberg-backed table service on S3) and its integration with <strong>SageMaker Lakehouse (AWS Glue as the metadata catalog)</strong>. DuckDB&#8217;s team worked with AWS to ensure DuckDB could serve as an <strong>&#8220;end-to-end solution&#8221;</strong> for reading those Iceberg tables. The result is that <strong>DuckDB can query Iceberg data natively, without any Spark or Hive</strong> in the middle, bringing truly democratized access to lakehouse data.</p><p><strong>Practical Implications:</strong> This development has several important implications:</p><ul><li><p><em>Lightweight Analytics on Data Lakes:</em> Now a data analyst with a Jupyter notebook can connect to a massive Iceberg table on S3 using DuckDB on their laptop. There&#8217;s no need to spin up a heavy Spark cluster or Trino service for interactive analysis. This lowers the barrier to entry for ad-hoc analytics on the data lake. DuckDB&#8217;s efficient vectorized engine can handle surprisingly large queries, especially with its ability to push down filters to the Iceberg metadata (e.g. it will only fetch relevant Parquet files). In short, <strong>cheap, user-friendly SQL analytics directly on the data lake</strong> becomes possible.<br></p></li><li><p><em>Integration with Python/AI Workflows:</em> DuckDB is often embedded in Python environments. With Iceberg support, Python data science workflows (Pandas, Polars, etc.) can pull in fresh data from the lakehouse on demand via DuckDB, all while respecting snapshot consistency. For AI/ML experiments, this means you can directly query feature data or training data that&#8217;s stored in Iceberg format, using simple SQL, and get the benefit of Iceberg&#8217;s versioning (e.g. load the exact snapshot of data that a model was trained on).<br></p></li><li><p><em>Cloud/On-Prem Hybrid Access:</em> Because DuckDB is just a library, you could run it anywhere &#8211; on your local machine, on an edge server, or within a cloud function &#8211; and attach to the same Iceberg table in the cloud. This is a form of &#8220;serverless&#8221; or on-the-fly data access. For instance, a nightly on-prem job could use DuckDB to pull some metrics from a cloud data lake table. The Iceberg abstraction makes the remote data appear as a regular SQL table.<br></p></li><li><p><em>Still Read-Only (for Now):</em> The current DuckDB Iceberg integration is focused on reading tables. The <strong>preview is experimental</strong> and does not yet support writing back to Iceberg. In practice, this is not a huge limitation &#8211; writes are typically handled by heavier ETL processes (Spark/Flink) &#8211; and DuckDB can be used for fast querying and prototyping. The DuckDB team has indicated that a fully stable release will come later in 2025, likely with more features.</p></li></ul><p>Overall, the DuckDB-Iceberg integration exemplifies the openness of Iceberg: even an in-process database can join the party of engines working on the same data. It underscores the trend of <strong>convergence between analytics tools</strong> &#8211; cloud data in Iceberg format can be accessed by cloud data warehouses, big data engines, and now even local analytics libraries. This flexibility is exactly what modern data architectures are aiming for.</p><h2><strong>Building a Lakehouse with Iceberg: Cloud and Hybrid Patterns</strong></h2><p>Apache Iceberg is a cornerstone of the <strong>data lakehouse</strong> architecture. A lakehouse typically consists of a unified storage layer (like a data lake on cloud storage or HDFS) with an open table format, and multiple compute engines for different workloads (ETL, ad-hoc SQL, machine learning, etc.). Iceberg enables this by acting as the <strong>table format and governance layer</strong> on the storage, ensuring all engines see consistent data. Let&#8217;s explore some ideal technology combinations for lakehouse and analytics use cases, in both cloud and hybrid environments:</p><p><strong>Storage Layer:</strong> In a modern lakehouse, the storage is often cheap, scalable cloud object storage (e.g. Amazon S3, Google Cloud Storage, Azure Data Lake Storage) or a distributed filesystem (HDFS) for on-prem. Iceberg sits on top of this storage, managing files (typically in Parquet/ORC format) and tracking metadata. <a href="https://atlan.com/know/iceberg/apache-iceberg-101/#:~:text=Note%20that%20Iceberg%20allows%20you,API%20or%20the%20JDBC%20option">The </a><strong><a href="https://atlan.com/know/iceberg/apache-iceberg-101/#:~:text=Note%20that%20Iceberg%20allows%20you,API%20or%20the%20JDBC%20option">Iceberg catalog</a></strong><a href="https://atlan.com/know/iceberg/apache-iceberg-101/#:~:text=Note%20that%20Iceberg%20allows%20you,API%20or%20the%20JDBC%20option"> (backed by Hive Metastore, AWS Glue, a REST service, etc.) serves as the source of truth for table definitions and snapshots</a>.</p><p><strong>Compute Engines:</strong> With Iceberg in place, you can mix and match engines, each specialized for a purpose:</p><ul><li><p><strong>Batch ETL and Processing:</strong> Apache <strong>Spark</strong> is commonly used for heavy batch transformations. For example, Spark jobs can read raw data, join and aggregate, and then <strong>write the results into an Iceberg table</strong> (on S3 or HDFS) as the sink. Iceberg&#8217;s format ensures each Spark job&#8217;s output is atomically added as a new snapshot. Other engines like <strong>Flink</strong> or <strong>SQL-based ELT tools</strong> can also produce data into Iceberg. The key is that the <strong>ingest engine can be chosen freely</strong> &#8211; as long as it speaks Iceberg, it can write to the lakehouse.<br></p></li><li><p><strong>Interactive SQL and BI:</strong> For analytics and BI queries, a SQL query engine like <strong>Trino</strong> (Presto) is often layered on top. Trino has a native Iceberg connector, so it can query the tables Spark produced, with full support for partition pruning, filtering, even updates and deletes. Because Iceberg tracks precise data file stats, engines like Trino can perform interactive queries efficiently on huge datasets. Even <strong>cloud data warehouses</strong> like Snowflake or BigQuery can play this role by querying the Iceberg tables (more on that later). The benefit is that <strong>business analysts or tools (Tableau, etc.) can run fast queries on the latest data</strong> in the lake without moving it to a separate warehouse.<br></p></li><li><p><strong>Data Science/ML Access:</strong> For machine learning model training or data exploration, teams might use Python or R environments. Instead of extracting data from a warehouse, they can directly load features or training sets from the Iceberg tables &#8211; either via DuckDB (as discussed) or Spark/PySpark, etc. This ensures that everyone (analytics or ML) is working off the <strong>same single source of truth</strong> on the lake.</p></li></ul><p><strong>Cloud-Native Lakehouse Example (AWS):</strong> Imagine a stack on AWS &#8211; data is stored in <strong>Amazon S3</strong>, and Iceberg tables are tracked in <strong>AWS Glue Catalog</strong> (or the new AWS &#8220;S3 Tables&#8221; service). <strong>Apache Spark on EMR</strong> might perform nightly batch ETL, writing data to these Iceberg tables. Analysts then use <strong>Trino (Athena)</strong> to run SQL queries directly on those Iceberg tables for interactive analysis. Because Iceberg is ACID, the Athena users never see partial writes; they only query committed snapshots. If needed, a <strong>Redshift Spectrum</strong> or Snowflake external table could also query the same data in S3. In this setup, S3 + Iceberg acts like the &#8220;data lakehouse&#8221; storage, and multiple query engines coexist on it. This provides warehouse-like capabilities (SQL, transactions) <em>without all data having to reside inside a single proprietary warehouse</em>, offering flexibility and cost advantages.</p><p><strong>Hybrid Scenario:</strong> One of Iceberg&#8217;s strengths is in <strong>hybrid cloud</strong> environments. For instance, consider a company that has on-premises data centers and also uses cloud analytics services. They could keep an Iceberg table&#8217;s data in an on-prem HDFS cluster, but use a <strong>shared catalog</strong> so that cloud engines can also access it. Tools like <strong>Apache Hive or Spark on-prem</strong> might produce the data, but then a cloud service like <strong>Google BigQuery</strong> (via BigLake) or <strong>Snowflake</strong> can query that same table remotely. Because the Iceberg catalog is the authority, even across environments, everyone sees the same schema and snapshots. LinkedIn provides a real-world example: <a href="https://www.upsolver.com/blog/iceberg-architecture-examples#:~:text=Professional%20social%20media%20platform%20LinkedIn,its%20Hadoop%20data%20lake%20for">they ingest streams from Kafka and Oracle into Hadoop, writing data into HDFS in ORC format managed by Iceberg</a>. Iceberg&#8217;s metadata gives them snapshot isolation and incremental processing for downstream consumers on that data. This means LinkedIn&#8217;s various pipelines (even outside of Hadoop) can consume the data gradually as new snapshots appear, without confusion or conflict. In general, Iceberg&#8217;s open format and catalog options (Hive, REST, etc.) enable <em>data sharing across on-prem and cloud</em>. A team can migrate or burst to cloud engines step by step, <strong>without migrating the entire dataset at once</strong>, since cloud and on-prem can both operate on the Iceberg table in parallel.</p><p>To summarize these patterns, we present a few typical Iceberg-based stack combinations and their benefits:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vZ0a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vZ0a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png 424w, https://substackcdn.com/image/fetch/$s_!vZ0a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png 848w, https://substackcdn.com/image/fetch/$s_!vZ0a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png 1272w, https://substackcdn.com/image/fetch/$s_!vZ0a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vZ0a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png" width="1240" height="1836" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1836,&quot;width&quot;:1240,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:476498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/164664013?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vZ0a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png 424w, https://substackcdn.com/image/fetch/$s_!vZ0a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png 848w, https://substackcdn.com/image/fetch/$s_!vZ0a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png 1272w, https://substackcdn.com/image/fetch/$s_!vZ0a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86ca8090-883b-40c2-bc41-62a5a4522baf_1240x1836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>In all these scenarios, Iceberg provides the <strong>common table layer</strong> that knits the stack together. Whether it&#8217;s batch or streaming, cloud or on-prem, the <strong>decoupling of storage and compute</strong> that Iceberg enables is a game-changer. Data resides once (in open files), and many tools can use it in their own way &#8211; yet <strong>data governance, schema, and consistency are centrally maintained by Iceberg</strong>. This architecture is a key reason organizations are moving towards Iceberg-based lakehouses as a future-proof, flexible data infrastructure.</p><h2><strong>Streaming Data and Real-Time Analytics with Iceberg</strong></h2><p>Traditional data lakes were often batch-oriented, but Apache Iceberg also supports modern <strong>streaming and real-time data pipelines</strong>. By leveraging Iceberg&#8217;s design, we can build streaming systems that land data into analytical tables continuously, enabling up-to-date analytics without sacrificing consistency. There are a few dimensions to how Iceberg is used in streaming contexts:</p><p><strong>Streaming Ingest (Write) into Iceberg:</strong> Tools like <strong>Apache Flink</strong> and Spark Structured Streaming can write to Iceberg tables in a streaming fashion. Rather than huge batch dumps, data can be <strong>ingested continuously</strong> into the Iceberg table as small batches or micro-streams. For example, Flink has an Iceberg sink connector that will periodically commit new data to the table (each commit becomes a new snapshot). This allows near real-time data to accumulate in the lakehouse. An important advantage of using Iceberg here is that even though data arrives continuously, <strong>each Flink commit is atomic</strong> &#8211; if you query the table, you either see the new batch entirely or not at all. Downstream jobs or analysts don&#8217;t see half-written data. This is crucial for <strong>data quality in streaming</strong>. It effectively brings the reliability of a database to a data stream landing in cloud storage.</p><ul><li><p><em>Comparison to Direct Kafka Usage:</em> Often, streaming architectures pipe data into Kafka for real-time and then later batch it into a lake. With Iceberg, some use cases can eliminate the extra queue: you can stream directly into an Iceberg table and treat that as both the source of truth and the serving layer for analytics. Steven Wu (an engineer at Apple) discussed <a href="https://www.infoq.com/presentations/apache-iceberg-streaming/#:~:text=Summary">how an &#8220;Iceberg streaming source&#8221; can power many common stream processing use cases</a>, comparing it to using Kafka as a source. The takeaway is that while Kafka is great for ultra-low latency and pub/sub, Iceberg can serve in scenarios where a few seconds or minutes latency is acceptable in exchange for <strong>far simpler architecture and lower cost</strong> (since everything lands in the analytic store directly). Flink&#8217;s Iceberg integration is designed to allow <strong>incremental commits</strong> to Iceberg, meaning you can get latency in the order of seconds to a minute, yet have all the data neatly stored in Parquet files and immediately queryable by SQL engines.<br></p></li><li><p><strong>Incremental Consumption (Read) from Iceberg:</strong> The flip side is reading Iceberg in a streaming manner. Iceberg&#8217;s table format supports <strong>incremental scan</strong>, where a processing engine can ask &#8220;give me all new data since snapshot X&#8221;. Apache Flink leverages this to create <strong>Iceberg source streams</strong>. In other words, Flink can treat an Iceberg table as if it were a continually updating source &#8211; each new snapshot in the table is like an event that Flink can capture and process further. This is a form of <strong>change data capture (CDC)</strong> or incremental ETL. For example, if one job is writing to an Iceberg table, another Flink job could monitor that table and immediately process new rows for downstream systems, achieving an <strong>end-to-end streaming pipeline entirely on Iceberg</strong>. This capability shows how Iceberg isn&#8217;t just for static datasets; it can be part of streaming architectures, acting almost like a message bus with storage.<br></p></li></ul><p><strong>Event Streaming + Iceberg in Tandem:</strong> Many real-world architectures combine event streams (Kafka, Kinesis) with Iceberg. A common pattern is: Kafka for immediate processing and low-latency data (seconds), and Iceberg for durable, queryable storage of events (minutes and beyond). Tools like Flink can read from Kafka, do some transformations, and continuously <strong>upsert into an Iceberg table</strong>. The Iceberg table then becomes a <strong>queryable history</strong> of the stream. This is very useful for <strong>late arrivals and backfills</strong> &#8211; things that are hard to manage in a pure event stream can be naturally handled in Iceberg (since you can always write a correction record or compact data in Iceberg). Netflix&#8217;s use case aligns with this: they built an incremental processing solution where data flows through their stream processing (<a href="https://www.upsolver.com/blog/iceberg-architecture-examples#:~:text=users%2C%20including%20data%20scientists%2C%20content,analysts%20across%20various%20use%20cases">Maestro workflow</a>) into Iceberg, which provides a scalable way to do backfills and guarantee accuracy without reprocessing everything from scratch.</p><p><strong>Real-Time Analytics:</strong> Once streaming data lands in Iceberg, it opens the door for real-time analytics on the lakehouse. For instance, you might have a dashboard running Presto/Trino queries on an Iceberg table that is updated by Flink every few minutes. You get &#8220;near-real-time BI&#8221; without a separate real-time database. Iceberg ensures that each query sees a consistent snapshot. If the dashboard is set to refresh every 5 minutes, and Flink commits new data every minute, the viewers will always see a point-in-time consistent view. They might be, say, one minute behind the latest data &#8211; often an acceptable trade-off.</p><p>Another emerging concept is using Iceberg as a <strong>materialized view or serving layer</strong> for streams. Because Iceberg can organize data by time and partition, a well-designed Iceberg table can serve both recent data fast and also provide the historical depth. Companies like Confluent (with Kafka and Flink) talk about leveraging Iceberg to unify streaming and batch views of data. Instead of maintaining both a real-time store and a separate batch store, Iceberg can act as the singular store that is continuously updated (sometimes this approach is called a &#8220;unified log&#8221; or <strong>Lakehouse with streaming</strong>).</p><p><strong>Latency and Considerations:</strong> It&#8217;s important to note that Iceberg-based streaming typically has slightly higher latency than pure event streaming (due to the time to buffer and create file commits). However, it greatly simplifies architectures for <strong>many analytics use cases that don&#8217;t require sub-second latency</strong>. It&#8217;s a more <em>cost-efficient</em> approach for high throughput streams where storing long histories in Kafka would be expensive. By using Iceberg, all that data is immediately available for SQL queries and ML, which is not true for raw event logs.</p><p>In summary, Iceberg extends its usefulness to streaming scenarios by providing a framework to ingest, store, and consume continuous data with the same robustness as batch data. This <strong>convergence of streaming and batch</strong> in the lakehouse is a powerful concept: one platform for both real-time and historical analytics. As tooling (like Flink, Spark streaming, Kafka Connect, etc.) continues to improve around Iceberg, we can expect more and more &#8220;real-time lakehouse&#8221; deployments in production.</p><p></p><h2><strong>Integration with Other Key Technologies and Engines</strong></h2><p>Apache Iceberg&#8217;s success is largely due to its broad ecosystem integration. It is not a standalone system; it works in concert with many engines and platforms. Here we provide in-depth information on how several key technologies integrate with Iceberg and what capabilities each combination offers:</p><h3><strong>Apache Spark</strong></h3><p>Apache Spark was one of the first engines to integrate with Iceberg (indeed, Iceberg was born out of Netflix&#8217;s use of Spark with petabyte datasets). Spark provides an Iceberg connector (available as a package iceberg-spark for Spark 3.x) that allows Spark SQL and DataFrame APIs to <strong>read and write Iceberg tables</strong> natively. This means you can do spark.read.format("iceberg").load("db.table") to load an Iceberg table into a DataFrame, or write DataFrames out to Iceberg with full support for partitioning, schema, etc. Spark + Iceberg offers a powerful combination:</p><ul><li><p><strong>ETL and Batch Writes:</strong> Spark is commonly used to perform large-scale ETL into Iceberg. It can create or replace entire partitions, or append new data, using Iceberg&#8217;s API to commit changes. Because Iceberg ensures atomic commits, multiple Spark jobs can safely write to different tables (or even the same table in some cases with proper isolation) without interfering.<br></p></li><li><p><strong>SQL Support:</strong> Spark SQL understands Iceberg tables as first-class citizens. One can write SQL to query Iceberg tables (Spark will push down predicates to Iceberg for file pruning). Spark 3 even supports <strong>DML operations</strong> on Iceberg tables: you can MERGE INTO, UPDATE or DELETE from an Iceberg table using standard ANSI SQL syntax, and Spark will translate that into Iceberg&#8217;s procedures (creating new data files and delete files under the hood). This brings data warehouse-like data management directly to your data lake.<br></p></li><li><p><strong>Streaming Reads/Writes:</strong> Spark Structured Streaming can read from an Iceberg table as a source (treating new snapshots as increments) and also write to Iceberg as a sink. This isn&#8217;t as real-time as Flink typically, but it allows Spark to be part of streaming pipelines with Iceberg.<br></p></li><li><p><strong>Use in Databricks:</strong> Notably, Databricks (the company behind Delta Lake) has also added support for Iceberg in its platform. This means on a Databricks cluster you could choose to use Iceberg instead of Delta for certain tables. Databricks has even open-sourced some adaptors to ensure their runtime works efficiently with Iceberg. This is a testament to Iceberg&#8217;s popularity &#8211; even a would-be competitor like Databricks is embracing it because customers use it.<br></p></li><li><p><strong>Migration of Hive Tables:</strong> A big use case for Spark + Iceberg is migrating legacy Hive tables into Iceberg format. Spark has utilities to snapshot existing Parquet datasets into Iceberg without data copy. Once under Iceberg, the old Hive table&#8217;s issues (like too many partitions) disappear. Many companies (as noted earlier) have done exactly this to solve Hive metastore bottlenecks.</p></li></ul><p>In short, Spark serves as both a <strong>producer and consumer</strong> in the Iceberg ecosystem. It brings the muscle for heavy data lifting, and Iceberg ensures those outputs are accessible beyond Spark.</p><h3><strong>Trino / Presto</strong></h3><p><strong>Trino</strong> (formerly PrestoSQL) is a distributed SQL query engine often used for interactive analytics and BI dashboards. Trino has a well-developed Iceberg connector which allows it to query (and modify) Iceberg tables as if they were tables in a database. Key points about Trino + Iceberg:</p><ul><li><p><strong>Interactive Analytics:</strong> Trino is designed for low-latency SQL on large data. With Iceberg, Trino can perform interactive queries even on huge datasets because Iceberg&#8217;s metadata helps Trino skip reading a lot of unnecessary data. For example, Trino will use Iceberg&#8217;s partition stats and file stats to only read relevant files for a given query, making scans much faster than if it had to brute-force read all files. One blog noted that <strong>many engines including Trino have done significant work to optimize Iceberg queries</strong>, indicating the performance can be very high.<br></p></li><li><p><strong>DML and Transactions:</strong> Trino&#8217;s Iceberg connector supports not just SELECTs but also INSERT, UPDATE, DELETE, and MERGE operations on Iceberg tables (assuming the Trino cluster has the correct privileges and is configured with an Iceberg catalog). Underneath, each such operation becomes a transactional commit in Iceberg. This means you could use Trino to do data cleaning (e.g. &#8220;DELETE FROM table WHERE bad_field IS NULL&#8221;) and it would actually rewrite or remove data files in the Iceberg table, following Iceberg&#8217;s format. Such capabilities were not possible with older Hive tables in Presto &#8211; it&#8217;s the Iceberg layer that makes it possible to mutate data reliably.<br></p></li><li><p><strong>Multiple Catalogs:</strong> Trino can be configured to use various Iceberg catalogs. For example, you might configure one catalog that points to an <strong>HDFS-based warehouse</strong> (with a Hive Metastore), and another that points to an <strong>S3-based Iceberg warehouse</strong> (with Glue). This flexibility allows querying Iceberg tables across environments in one Trino query. Trino doesn&#8217;t care where the data is &#8211; it only needs the Iceberg API and the underlying file access.<br></p></li><li><p><strong>Concurrent Use with Other Engines:</strong> Because Trino is often a read-heavy engine and Spark is write-heavy, a common pattern is Spark produces data into Iceberg, and Trino serves it to analysts. They coordinate through the Iceberg catalog. As soon as Spark commits a snapshot, Trino queries will see the new data on the next snapshot refresh. If a query is mid-flight, it uses the last snapshot; the next query will use the new one. This ensures readers (Trino) and writers (Spark) don&#8217;t conflict &#8211; a form of <strong>multi-engine concurrency</strong> which Iceberg enables by design.</p></li></ul><p>Trino is increasingly used as the <strong>SQL interface for the lakehouse</strong> in many companies, and Iceberg is a natural fit for it. The combination provides a feel of a &#8220;database&#8221; (Trino + Iceberg) that is actually running over open files.</p><p><em>(Note: PrestoDB, the Facebook-led fork of Presto, also has an Iceberg connector. However, most of the community and new features are in Trino, so we focus on Trino.)</em></p><h3><strong>Apache Flink</strong></h3><p>Apache Flink is a powerful stream processing engine, and its <a href="https://alibaba-cloud.medium.com/apache-iceberg-0-11-0-features-and-deep-integration-with-flink-b183a6242894#:~:text=Apache%20Iceberg%200,Apache%20Iceberg%20through%20Flink">integration with Iceberg</a> positions Iceberg as a bridge between streaming data and batch analytics. <a href="https://www.decodable.co/blog/kafka-to-iceberg-with-flink#:~:text=Sending%20Data%20to%20Apache%20Iceberg,using%20purely%20open%20source%20tools">Flink&#8217;s Iceberg integration</a> includes:</p><ul><li><p><strong>Iceberg Sink:</strong> Flink can write data into Iceberg tables in real-time. This sink is designed to <strong>batch up small events into parquet files</strong> and commit them on the fly. It uses the Iceberg API to commit snapshots, so each Flink checkpoint can correspond to an Iceberg commit. This allows exactly-once delivery semantics from Flink into Iceberg (with careful configuration). Many folks use this to <strong>ingest data from Kafka into Iceberg</strong> continuously, as mentioned earlier. A nice feature is that Flink can also handle <strong>UPSERT</strong> streams into Iceberg (for tables with primary keys) by writing equality delete files and new rows, enabling slowly-changing data to be stored in Iceberg.<br></p></li><li><p><strong>Iceberg Source:</strong> Flink can treat an Iceberg table as a source of streaming data. Internally, this means Flink will periodically check the table for new snapshots and emit new rows or changes. This effectively makes the Iceberg table like a <strong>change log</strong> that Flink can further process. For example, if you have an Iceberg table that is being updated by some batch jobs, a Flink job could monitor it and index those updates into Elasticsearch, etc. This source is a newer capability (supported since Iceberg 0.11+ and Flink 1.14+); it uses Iceberg&#8217;s snapshot diff feature to only read the deltas.<br></p></li><li><p><strong>Unified Pipeline (Batch + Stream):</strong> Flink is unique in that it can do batch and stream in one. With Iceberg, a Flink job could run in micro-batch mode, reading an Iceberg table, doing some aggregation, and writing the result out to another Iceberg table, all continuously. Because both source and sink are Iceberg, the entire pipeline benefits from schema evolution, time travel (for reprocessing old data if needed), and no separate staging storage. This greatly simplifies architectures for complex data transformations that need to run 24/7.<br></p></li><li><p><strong>Integration in Confluent Platform:</strong> Confluent (known for Kafka) has been investing in Flink and mentioned &#8220;Tableflow&#8221; in some contexts &#8211; essentially <a href="https://www.kai-waehner.de/blog/2024/07/13/apache-iceberg-the-open-table-format-for-lakehouse-and-data-streaming/#:~:text=match%20at%20L313%20Snowflake%E2%80%99s%20Polaris%2C,Databricks%E2%80%99%20Unity%2C%20and%20Confluent%E2%80%99s%20Tableflow">using Flink+Iceberg to integrate streaming with table storage</a>. This indicates how streaming vendors see Iceberg as important. By using Flink and Iceberg together, one can implement the concept of <strong>data as a continually updating table</strong> rather than an unbounded event log, which is a more query-friendly format for many users.</p></li></ul><p>In practice, Flink + Iceberg is a popular choice for <em>real-time data lakes</em>. Companies like Apple, for example, have engineers (like Steven Wu in the talk) working on these integrations for their AI/ML platforms. Alibaba contributed heavily to Flink-Iceberg integration as well (for their cloud). If Spark is the king of batch on Iceberg, Flink is becoming the king of streaming on Iceberg.</p><h3><strong>Snowflake</strong></h3><p>Snowflake is a cloud data warehouse that traditionally kept data in its own proprietary storage format. But <a href="https://docs.snowflake.com/en/user-guide/tables-iceberg#:~:text=Data%20storage%C2%B6">Snowflake now supports </a><strong><a href="https://docs.snowflake.com/en/user-guide/tables-iceberg#:~:text=Data%20storage%C2%B6">Apache Iceberg tables</a></strong> in a feature they sometimes call <strong>&#8220;Iceberg Tables&#8221; in Snowflake</strong>. This integration is significant because it blends the lines between data warehouse and data lake. Here&#8217;s how it works:</p><ul><li><p><strong>External Iceberg Tables:</strong> Snowflake can query Iceberg tables that live outside of Snowflake. The data (Parquet files and Iceberg metadata) might be in an S3 bucket that Snowflake has access to. Using an <strong>External Volume</strong> (Snowflake object) to connect to the cloud storage, Snowflake can treat an Iceberg table almost like one of its own. The <strong>Iceberg catalog</strong> can be external (like AWS Glue) or Snowflake can be configured to manage the Iceberg metadata itself. In both cases, Snowflake&#8217;s compute engine will read the Parquet files directly and use Iceberg metadata for partition pruning, etc. The benefit here is that <strong>Snowflake users can join Iceberg data with their internal tables</strong> in one query. And importantly, <em>Snowflake does not charge storage for Iceberg tables</em> (since Snowflake isn&#8217;t storing the data) &#8211; you only pay Snowflake for compute when querying, making it an attractive option to bring your own Iceberg data.<br></p></li><li><p><strong>Snowflake-managed Iceberg Tables:</strong> Snowflake also introduced the ability to create a table in Snowflake that is actually stored as Iceberg in your cloud storage. In this scenario, Snowflake itself can act as the Iceberg catalog. So when you do CREATE TABLE ... USING ICEBERG in Snowflake, it might create the metadata in its own services but the data files in your S3. This gives you the best of both worlds &#8211; Snowflake&#8217;s high-performance engine and services like automatic clustering, plus data that&#8217;s in open format you can also access outside of Snowflake if needed.<br></p></li><li><p><strong>Functionality:</strong> Snowflake supports schema evolution and other Iceberg features on these tables. You can use Snowflake SQL to insert or update Iceberg tables. Under the hood Snowflake will generate Parquet files and update the Iceberg metadata (it&#8217;s effectively acting as an Iceberg-compatible writer). Snowflake ensures these operations are transactionally consistent via Iceberg&#8217;s atomic commit design. One limitation is that currently Snowflake&#8217;s Iceberg tables are often <strong>read-only to other engines while Snowflake is writing</strong> unless using an external catalog integration. Snowflake has something called <strong>Snowflake Open Catalog</strong> which can sync Snowflake-managed Iceberg metadata out so that outside engines (like Spark) could see it. This is evolving, but it highlights a key point: Snowflake is treating Iceberg as a <strong>first-class citizen</strong>.<br></p></li><li><p><strong>Cross-cloud and Hybrid:</strong> A really interesting aspect is <strong>cross-cloud querying</strong>. Because Iceberg keeps data in e.g. S3, Snowflake (which runs in AWS, Azure, GCP) can attach to an Iceberg table even if the table is stored in another cloud or region, as long as network access is in place. This decoupling can help avoid silos. For example, a Snowflake user in Azure can query an Iceberg table stored on AWS S3. Traditionally that&#8217;s hard to do without moving data. Iceberg (plus Snowflake&#8217;s external volume concept) makes it feasible.</p></li></ul><p><strong>Why use Snowflake with Iceberg?</strong> One might do this to use Snowflake&#8217;s smooth managed service and powerful optimizer on existing data lake files. It can be a bridge for organizations that have a foot in the open data lake world and a foot in the warehouse world. It also shows the confidence Snowflake has in Iceberg as a stable format. Snowflake&#8217;s documentation and updates indicate they are regularly adding new Iceberg features, meaning over time the gap between what Snowflake can do with its internal tables versus Iceberg tables is closing.</p><h3><strong>Google BigQuery</strong></h3><p>Google BigQuery has similarly embraced Iceberg, primarily through its <strong>BigLake</strong> initiative. BigLake is a capability that allows BigQuery (and other Google Cloud services) to treat external data on cloud storage with governance and performance optimizations. Iceberg is one of the formats supported:</p><ul><li><p><strong>BigLake External Tables for Iceberg:</strong> BigQuery can <a href="https://cloud.google.com/bigquery/docs/iceberg-tables#:~:text=BigLake%20external%20tables%20for%20Apache,only%20be%20queried%20using">create an external table pointing to an Iceberg table on Google Cloud Storage (GCS)</a>. This is read-only access &#8211; BigQuery will query the data via the Iceberg metadata, but not write to it. BigQuery applies its security and fine-grained access control on these external tables, integrating with GCP&#8217;s unified controls. The performance is improved over naive external queries because BigQuery can use Iceberg&#8217;s partition pruning and maybe pushdown filters (though BigQuery may still need to spin up some execution to read the Parquet files). This allows <strong>federated querying</strong>, where BigQuery SQL can combine data from Iceberg tables and native BigQuery tables.<br></p></li><li><p><strong>BigQuery Iceberg Tables (Managed):</strong> In 2025, <a href="https://www.infoworld.com/article/3808672/google-bigquery-gets-metadata-service-with-iceberg-support.html#:~:text=Google%20Cloud%20is%20adding%20a,down%20complexities%20around%20metadata%20management">Google announced </a><strong><a href="https://www.infoworld.com/article/3808672/google-bigquery-gets-metadata-service-with-iceberg-support.html#:~:text=Google%20Cloud%20is%20adding%20a,down%20complexities%20around%20metadata%20management">BigQuery support for Iceberg Metastore</a></strong><a href="https://www.infoworld.com/article/3808672/google-bigquery-gets-metadata-service-with-iceberg-support.html#:~:text=Google%20Cloud%20is%20adding%20a,down%20complexities%20around%20metadata%20management"> functionality</a>. Essentially, Google introduced a <strong>BigQuery Metastore that is Iceberg-compatible</strong>, aiming to unify metadata across engines. BigQuery can actually <strong>manage Iceberg tables natively</strong> &#8211; sometimes referred to as &#8220;BigQuery managed Iceberg tables&#8221;. These are similar to Snowflake&#8217;s managed approach: data in GCS, but BigQuery can update it and track it. The BigQuery metastore is a fully managed service where you can register Iceberg tables, and it ensures that engines like Spark or Flink can interoperate with BigQuery on the same data. The motivation, as Google described, is to avoid the scenario of multiple metastores (Hive, etc.) with inconsistent definitions when you use multiple engines. With a unified Iceberg metastore, BigQuery wants to let you <strong>query one copy of data with a single schema, whether using BigQuery or Spark or Flink</strong>.<br></p></li><li><p><strong>BigQuery + Dataproc (Spark)</strong>: Google&#8217;s Dataproc is basically managed Spark/Flint/Hive on GCP. Dataproc has Iceberg packages too. With the new BigQuery metastore, a Spark running on Dataproc can use the same Iceberg table as a BigQuery SQL query, and the metadata doesn&#8217;t need duplication &#8211; they both talk to the BigQuery Iceberg catalog. This scenario is a classic lakehouse: Spark might write data, BigQuery reads it, or vice versa, with full consistency. It&#8217;s essentially Google&#8217;s version of the lakehouse using Iceberg under the hood.</p></li></ul><p>The bottom line is Google acknowledges the need for open formats. Rather than only using BigQuery&#8217;s native storage (which is proprietary), they&#8217;re giving customers the option to use Iceberg to <strong>enable multi-engine flexibility</strong>. This is attractive for hybrid scenarios too: e.g., on-prem Spark jobs could produce Iceberg data, which BigQuery in GCP can then analyze, facilitating cloud adoption.</p><h3><strong>Other Engines and Tools</strong></h3><p>Beyond the above, there are other notable integrations:</p><ul><li><p><strong>Apache Hive:</strong> Iceberg has a Hive runtime module that lets Hive read Iceberg tables (Hive treats it kind of like an external table). However, Hive&#8217;s slow development and the rise of newer engines mean this is less commonly used except for legacy compatibility.<br></p></li><li><p><strong>Apache Impala / Cloudera:</strong> <a href="https://www.datagravity.dev/p/databricks-acquires-tabular-what#:~:text=Since%2C%20Apache%20Iceberg%20has%20received,and%20versatility%20in%20the%20industry">Cloudera&#8217;s platform (CDP) has added support for Iceberg in Impala and Hive LLAP</a>. Cloudera, as noted, embraced Iceberg (as well as Hudi) to give its customers modern table format options. So Impala, a traditional MPP SQL-on-Hadoop engine, can query Iceberg tables, use Iceberg&#8217;s metadata for performance, etc. This allows Cloudera users (on-prem enterprise Hadoop) to also enjoy Iceberg&#8217;s benefits without bringing in new engines.<br></p></li><li><p><strong>Dremio:</strong> Dremio is a SQL engine and data platform that heavily uses Apache Arrow. Dremio has been a big proponent of Iceberg &#8211; they even created a managed Iceberg catalog service called <strong>Arctic</strong>. Dremio&#8217;s query engine can read Iceberg tables and they integrate features like reflection (caching) with Iceberg metadata. Essentially, Dremio provides a BI-friendly layer over Iceberg and has contributed improvements to the project.<br></p></li><li><p><strong>AWS Athena:</strong> Athena is Amazon&#8217;s Trino-based serverless query service. Athena added support for Iceberg as one of its table formats (in addition to Hive, Hudi, etc.). Athena uses the AWS Glue Data Catalog, which now can store Iceberg table metadata. This means Athena users can create Iceberg tables and query them with Athena&#8217;s SQL. The benefit is more functionality (updates, time travel via Athena console by specifying a snapshot ID, etc.) over plain Hive tables, and better performance on partition-heavy data.<br></p></li><li><p><strong>Redshift Spectrum:</strong> Redshift&#8217;s external querying feature likely can work with Iceberg tables registered in Glue as well, though Amazon more actively pushes Athena or their new S3 Tables approach for Iceberg.<br></p></li><li><p><strong>Query Federation Tools:</strong> There are tools like Starburst Galaxy (Trino SaaS) which provide an out-of-the-box Iceberg catalog and maintenance automation, and open source projects like <strong>Nessie</strong> which serve as an advanced Iceberg catalog with Git-like version control (we&#8217;ll discuss in ecosystem). These are part of the integration landscape making Iceberg easier to use in various contexts.</p></li></ul><p>The common theme is <strong>ubiquity</strong>: virtually any tool that deals with analytic data either already supports Iceberg or is in the process of doing so. This broad compatibility is what makes Iceberg so powerful in a modern stack &#8211; you aren&#8217;t limited to one framework&#8217;s capabilities. Each engine integrates in its own way but adheres to the Iceberg spec, allowing a seamless flow of data across them.</p><p></p><h2><strong>Tooling Ecosystem and Adoption in AI/ML Workflows</strong></h2><p>Beyond engines and query platforms, Apache Iceberg benefits from a growing ecosystem of catalogs, management tools, and adoption by forward-looking organizations. This ecosystem is crucial, especially as companies deploy Iceberg in support of AI and machine learning workflows that demand both <strong>flexibility and control</strong> over data.</p><h3><strong>Catalog and Metadata Tools</strong></h3><p>The <strong>Iceberg catalog</strong> is a pluggable component &#8211; it stores the mapping of table names to metadata files and coordinates atomic changes. Several options exist:</p><ul><li><p><strong>Hive Metastore</strong>: Many use Hive Metastore as a basic catalog (especially in on-prem or older environments). It stores the pointer to the Iceberg table&#8217;s latest metadata file. This works but Hive Metastore has scaling limits and lacks advanced features.<br></p></li><li><p><strong>AWS Glue Catalog</strong>: On AWS, Glue is a drop-in replacement for Hive Metastore (fully managed). Glue now supports Iceberg table types. It is commonly used for Iceberg catalogs in AWS (Athena, EMR, etc.). Glue is more scalable and integrated with AWS IAM security.<br></p></li><li><p><strong>Nessie</strong>: <em>Project Nessie</em> is an <a href="https://atlan.com/know/iceberg/apache-iceberg-101/#:~:text=Note%20that%20Iceberg%20allows%20you,API%20or%20the%20JDBC%20option">open source metadata service designed for Iceberg</a> (and Delta/Hudi). It introduces <strong>Git-like version control for data</strong>. With Nessie, you can have multiple named branches of an Iceberg table, tag snapshots, and merge changes. This is incredibly useful for ML and data experimentation &#8211; e.g., a data scientist can create a branch of the table, do some test writes or backfills, and if all looks good, merge to main, or discard if not. Nessie essentially serves as a <strong>modern catalog &#8220;control plane&#8221;</strong> for Iceberg. Dremio&#8217;s <strong>Arctic</strong> offering is built on Nessie, providing a cloud service where you get those capabilities plus nice UI. This kind of tooling is important for AI/ML, where you might want to maintain training datasets in different branches (say &#8220;production data&#8221; vs &#8220;experimentation data&#8221;) and ensure reproducibility.<br></p></li><li><p><strong>Iceberg REST Catalog</strong>: The Iceberg community defined a REST API for catalogs, which allows a lightweight service to host table metadata. Companies like Tabular (founded by Iceberg&#8217;s creators) offer a hosted REST catalog. AWS&#8217;s new S3 Table service also implements the REST catalog API. The advantage is simplicity and uniform access (the client doesn&#8217;t need a heavy Hive/Glue dependency, just HTTP). DuckDB&#8217;s integration, as mentioned, uses the REST catalog API to attach to AWS Glue or S3 Tables endpoints. This trend points to more cloud-native ways to manage Iceberg metadata that fit well with serverless and polyglot environments.<br></p></li><li><p><strong>Polaris/Atlan</strong>: Some data catalog companies (like Atlan with their Polaris) are integrating with Iceberg to provide governance, search, and lineage on Iceberg tables. Because Iceberg keeps a history of snapshots, it is fertile ground for building lineage tools and audit trails. We see cloud providers and startups building catalogs that treat Iceberg as the heart of the data lakehouse, adding on governance features like data masking, role-based access, and search over schema/data.</p></li></ul><p>In essence, a <strong>rich catalog layer</strong> is emerging around Iceberg, which is extremely important for AI/ML as these fields demand <strong>data versioning, reproducibility, and experiment tracking</strong>. The fact that Iceberg has built-in versioning and tracks all changes makes it much easier to integrate with experiment tracking systems (like one could store the snapshot ID used for a particular ML model training run, and later reconciling what data was used is straightforward).</p><p></p><h3><strong>Data Ingestion and Transformation Tools</strong></h3><p>Various tools in the data pipeline space have added support for Iceberg:</p><ul><li><p><strong>Kafka Connect</strong>: There are connectors that can sink data from Kafka topics into Iceberg tables (Confluent and others have released such connectors). This automates streaming ingest without needing a full Flink deployment.<br></p></li><li><p><strong>Apache NiFi / StreamSets</strong>: Some ETL tools can now write to Iceberg as a destination, making it easier to onboard data.<br></p></li><li><p><strong>dbt (Data Build Tool)</strong>: dbt, a popular analytics engineering tool, doesn&#8217;t directly interact with file tables, but it can run SQL against engines that do. So if one uses Trino or SparkSQL with Iceberg, dbt models can be built on Iceberg tables just like they would on a Snowflake table. This allows analytics teams to use familiar workflows (version-controlled SQL transformations) directly on the lakehouse data.<br></p></li><li><p><strong>Compaction and Maintenance</strong>: Over time, Iceberg tables may need maintenance (expiring old snapshots to free space, compacting small files, etc.). Tools to do this include built-in <strong>Spark actions</strong> (the Iceberg API provides procedures for expiring snapshots, rolling up manifests, etc.), standalone utilities, and features in managed offerings (e.g., Starburst&#8217;s platform can automatically compact Iceberg small files in the background). Proper tooling here is important to keep performance optimal (particularly for streaming ingest use cases that create many small files). As the ecosystem matures, we see more automated solutions to handle this, reducing the manual burden on data engineers.<br></p></li></ul><h3><strong>Adoption by Major Organizations (especially in AI/ML contexts)</strong></h3><p>Apache Iceberg has seen <strong>widespread adoption at some of the largest data-driven companies</strong>, many of whom explicitly mention its use in AI and ML workflows:</p><ul><li><p><strong>Netflix</strong> &#8211; Netflix created Iceberg to solve their big data challenges. It remains core to their data platform. Netflix has an end-to-end pipeline platform (called <strong>Maestro</strong>) and they combined it with Iceberg to handle incremental data processing with guaranteed accuracy. For ML, this means new training data can be backfilled or updated incrementally without confusion. Thousands of Netflix data users (analysts, data scientists) rely on Iceberg daily as the storage layer for data ranging from streaming metrics to content personalization data.<br></p></li><li><p><strong>Apple</strong> &#8211; Apple&#8217;s AIML (AI &amp; Machine Learning) data platform team is heavily invested in Iceberg. Apple has spoken about building streaming pipelines with Flink and Iceberg to support things like Siri and iCloud analytics. When a tech giant like Apple with massive ML workloads (think Siri&#8217;s natural language models, personalization across devices, etc.) uses Iceberg, it indicates that Iceberg can handle <em>extreme scale and strict data quality requirements</em>.<br></p></li><li><p><strong>LinkedIn</strong> &#8211;<a href="https://www.upsolver.com/blog/iceberg-architecture-examples#:~:text=Professional%20social%20media%20platform%20LinkedIn,its%20Hadoop%20data%20lake%20for"> LinkedIn uses Iceberg in their Hadoop-based data lake</a> to reduce latency and improve reliability of ingestion for their analytical datasets. One specific use: <strong>Gobblin</strong>, LinkedIn&#8217;s ingestion framework, writes to Iceberg to ensure that downstream jobs see a consistent view of new data. By using Iceberg, LinkedIn achieved lower latency from Kafka into HDFS (because they no longer needed to create ad-hoc Hive partitions or many small files). For ML, LinkedIn can incrementally update feature tables and immediately have those features available for model scoring or training in a consistent way.<br></p></li><li><p><strong>Adobe</strong> &#8211; Adobe Analytics and the Adobe Experience Cloud handle <strong>billions of web events and user data points</strong> for personalization and marketing analytics. Adobe adopted Iceberg to manage this firehose of event data and make it queryable for personalization algorithms. In practice, Iceberg helped Adobe ensure data consistency at scale and simplified their data engineering for user profiles (which power ML-driven personalization). Adobe has demonstrated how they can replay and time-travel on user event data to test new ML models against historical data, thanks to Iceberg&#8217;s snapshot abilities.<br></p></li><li><p><strong>Airbnb</strong> &#8211; <a href="https://www.upsolver.com/blog/iceberg-architecture-examples#:~:text=One%20of%20the%20biggest%20challenges,this%20was%20a%20time%20sink">Airbnb moved from a heavy Hive-on-HDFS infrastructure to Iceberg on S3</a> to support their data warehousing needs. They faced Hive Metastore overload and slow partitioning, which Iceberg solved\. Now, Airbnb analysts and ML pipelines (e.g. price optimization models, search ranking models) operate on Iceberg tables in S3. They have noted significantly reduced engineering overhead once on Iceberg &#8211; no more maintaining two sets of tables for different time granularities, etc., because Iceberg handles it gracefully.<br></p></li><li><p><strong>Stripe</strong> &#8211; Stripe processes financial transaction data and needs strong consistency. Stripe has been cited as a contributor to Iceberg. It&#8217;s likely used in their analytical platform for fraud detection features, which are ML-driven and require combining historical and real-time data under stringent correctness.<br></p></li><li><p><strong>Expedia</strong> &#8211; Expedia contributed to Iceberg as well. They likely use it for their travel data platform &#8211; imagine training ML models to recommend hotels or optimize pricing, on top of an Iceberg-backed feature store that contains years of travel booking data combined with fresh user search events.</p><p></p></li><li><p><strong>TikTok (ByteDance)</strong> &#8211; Although not listed in sources above, there was a mention of a <a href="https://www.youtube.com/watch?v=UPjr0qZ0-Do#:~:text=Building%20an%20EB%20Scale%20Feature,serving%20our%20search%2C%20advertising">YouTube talk</a> about <strong>building an exabyte-scale feature store at ByteDance with Iceberg</strong>. ByteDance (TikTok&#8217;s parent) handles enormous volumes of video engagement data and uses ML for recommendation. An Iceberg-based feature store at exabyte scale underscores how well Iceberg can scale and serve ML: it can deliver the throughput needed and maintain data versions for model training at one of the world&#8217;s largest social media data scales.</p></li></ul><p>This is just a sample; the list of companies using Iceberg reads like a who&#8217;s who of data innovation. Importantly, many of these organizations explicitly mention <strong>AI/ML use cases</strong> alongside analytics. The ability to <strong>share one data foundation between AI and BI</strong> is a huge win. Iceberg&#8217;s features like time travel allow ML engineers to train models on yesterday&#8217;s data and then easily compare with models trained on today&#8217;s data. Its schema evolution means adding a new feature column is straightforward and doesn&#8217;t break older pipelines &#8211; vital for iterative ML experimentation. Also, consistency guarantees mean that a model never trains on partial data from a broken pipeline; it either gets a full snapshot or none.</p><p><strong>Community and Contributors:</strong> Companies such as Netflix and Apple contribute code; tabular.io (the startup by creators) drives community efforts; Cloudera, Dremio, AWS, and others contribute as well. This broad contributor base (Netflix, Apple, Airbnb, Expedia, LinkedIn, Alibaba, Tencent, AWS, etc.) means Iceberg is not controlled by a single vendor, unlike some alternatives. That has reassured many users and spurred adoption &#8211; they see Iceberg as a truly open project with a long future.</p><h3><strong>AI/ML Workflow Benefits Recap</strong></h3><p>To highlight why Iceberg is great for AI/ML workflows (tying together the tools and adoption):</p><ul><li><p><strong>Feature Store Offload:</strong> Many feature stores (like SageMaker Feature Store, Feast, etc.) use Iceberg as the <strong>offline store</strong>. As mentioned, SageMaker saw up to 100x speedups on feature queries with Iceberg due to file compaction and efficient pruning. This leads to faster model training iterations.</p><p></p></li><li><p><strong>Experiment Reproducibility:</strong> By tagging snapshots or using branches, data scientists can recall exactly what data a model was trained on (fixing the &#8220;versioning the data&#8221; problem in ML). Iceberg&#8217;s snapshot IDs act as data version identifiers.</p><p></p></li><li><p><strong>Data Lineage for ML:</strong> Tools like Nessie allow branching, which can isolate &#8220;experimental data&#8221; from &#8220;production data&#8221; for ML. Also, if a new data processing logic is tested, you can write to a branch, validate model results, then merge.</p><p></p></li><li><p><strong>High Throughput for Big Data:</strong> AI often means big data (e.g., training a deep learning model on a huge dataset). Iceberg can feed these pipelines with throughput comparable to raw Parquet, because under the hood it is Parquet/ORC. The overhead of Iceberg&#8217;s metadata is negligible during large sequential reads &#8211; and it may even make things faster by skipping files. Thus ML pipelines (e.g., Spark ML jobs or TensorFlow reading from Petastorm which could use Iceberg) get both performance and reliability.</p><p></p></li><li><p><strong>Unified Batch and Real-Time Features:</strong> An emerging ML challenge is unifying batch and real-time feature engineering. Iceberg can store both historical features and be updated in near-real-time with new feature values (via streaming ingest). So one system can serve for training (which needs full history) and for near-real-time scoring (where new data comes in continually). This unity prevents training/serving skew.</p></li></ul><p>All these benefits are driving organizations with heavy ML investments to adopt Iceberg as a foundational layer. It&#8217;s not just a better &#8220;data table&#8221; &#8211; it&#8217;s enabling new, efficient ways to do data science on lakes</p><p></p><h2><strong>Architecture and Performance: How Iceberg and Its Tools Work Together</strong></h2><p>Apache Iceberg&#8217;s architecture underpins everything discussed so far, so we conclude with a summary of <em>how</em> Iceberg achieves these capabilities and integrates all the tools:</p><p><strong>Layered Architecture:</strong> Iceberg&#8217;s table format is built in layers &#8211; <strong>catalog, metadata, and data layers</strong> &#8211; which cleanly separate concerns:</p><ul><li><p>The <strong>Catalog layer</strong> stores the table&#8217;s current <em>metadata pointer</em> (and namespace of tables). This is like a lightweight metastore or &#8220;name service.&#8221; When a writer commits a new snapshot, it <em>atomically</em> updates this pointer in the catalog. When a reader opens the table, it retrieves the pointer to find the latest metadata. This design allows multiple engines to share the table easily: Spark, Flink, Trino, etc. all talk to the same catalog and thus agree on what the &#8220;current&#8221; snapshot of the table is. The catalog ensures operations like table creation, drop, rename, snapshot switch are atomic and consistent across all clients.</p><p></p></li><li><p>The <strong>Metadata layer</strong> is where Iceberg really innovates for performance. The main object is the <strong>metadata file</strong> (JSON format) which lists table schema, partition spec, and references to snapshot information. Each time data changes, a new metadata file is written. The metadata file contains pointers to one or more <strong>manifest lists</strong> (one per snapshot). A <strong>manifest list</strong> is essentially an index of all manifests that comprise that snapshot. Each <strong>manifest file</strong> in turn contains a list of <strong>data files</strong> and associated stats (min/max values, record count, etc.). This hierarchical design means that to plan a query, an engine can read just a few files: the metadata file (small JSON), one manifest list (small), and then only the manifest files relevant to the query (Iceberg will let the engine filter manifests by partition or other stats before reading them). This avoids ever having to touch the raw file system or list directories. For example, a table with 10,000 data files might be tracked by 10 manifest files (each listing ~1000 files). To find the subset for a query on one partition, maybe only 1 of those manifest files needs to be read &#8211; <em>result: only 1000 file entries are scanned instead of 10,000</em>, a big win. This is how Iceberg achieves <strong>advanced pruning and planning</strong> and makes queries on large data more efficient. The metadata layer also stores extra goodies like statistics and bloom filters that engines can use to skip even inside files.</p><p></p></li><li><p>The <strong>Data layer</strong> is the actual data files &#8211; usually Parquet, ORC, or Avro files with the table&#8217;s columns. These are stored in the lake (S3, HDFS, etc.). Iceberg does not change the file format; it leverages the highly optimized columnar formats. Thus, once an engine decides to read a particular data file, it uses the normal Parquet/ORC reader &#8211; benefiting from vectorized IO, predicate pushdown, column pruning, compression, etc. This means that Iceberg&#8217;s presence <strong>adds essentially zero overhead to the actual scan</strong> &#8211; if anything, it reduced overhead by eliminating needless file opens. In practice, query performance on an Iceberg table vs. the same data in raw Parquet files is equal or better, because Iceberg optimizes the planning phase heavily.<br></p></li></ul><p><strong>ACID and Multi-Engine Operation:</strong> When a writer (say Spark) wants to commit data:</p><ol><li><p>It writes new data files (e.g., Parquet files) to the storage.</p></li><li><p>It creates new manifest files that list those data files (and perhaps mark some old ones for deletion if it&#8217;s doing an update).</p></li><li><p>It writes a new manifest list for the snapshot, and a new metadata JSON pointing to that snapshot.</p></li><li><p>Finally, it asks the catalog to atomically update the table&#8217;s pointer to the new metadata file.</p></li></ol><p>This final step is where ACID magic happens: Iceberg uses atomic operations (like Glue&#8217;s transaction, Hive metastore&#8217;s lock, or a lightweight DB transaction in case of a REST catalog) to ensure only one writer&#8217;s metadata becomes &#8220;current&#8221;. If two jobs try to commit, one will succeed, the other will detect the conflict and can retry (using snapshot conflict detection). This optimistic concurrency model is similar to Git &#8211; if there&#8217;s a conflict, you rebase and retry commit. The outcome is <strong>multiple writers across different engines can reliably commit to the same table</strong> without corrupting it. This was impossible in the old Hive world. Iceberg&#8217;s design guarantees <strong>consistency</strong>: readers either see the old snapshot or the new one, never a mix. So, engines like Trino or Flink polling the catalog will just start reading the new snapshot once available. They don&#8217;t need a complex locking coordination with Spark &#8211; Iceberg did the heavy lifting via the metadata atomic swap.</p><p>From an <strong>architecture integration</strong> standpoint, this means each engine (Spark, Flink, etc.) just needs to implement the Iceberg API to read/write those metadata and data files. They don&#8217;t implement ACID themselves; they delegate to Iceberg. This made it relatively straightforward to add Iceberg support to engines &#8211; which is why we saw so many integrate it quickly. All engines speak the &#8220;Iceberg language&#8221; in metadata, so they naturally interoperate.</p><p><strong>Performance Considerations:</strong> A few points:</p><ul><li><p><strong>Eliminating &#8220;Small Files&#8221; Problem:</strong> By not exposing file count to users and handling merges via metadata, Iceberg encourages practices like compacting small files and writing larger files, which dramatically improves scan performance on cloud storage. Many engines have built-in actions or queries to compact Iceberg table files (e.g., a Spark SQL procedure OPTIMIZE table or an Iceberg API action can rewrite small files into bigger ones). The result is faster reads and fewer I/O operations. Organizations have reported order-of-magnitude improvements in certain queries after migrating to Iceberg simply because they got rid of overhead like millions of tiny files or overloaded metastores.<br></p></li><li><p><strong>Reduced Metadata Overhead:</strong> In large datasets, metadata operations (listing directories, opening thousands of file footers) can dominate query time. Iceberg&#8217;s manifest approach shrinks this overhead. For example, Netflix mentioned that with Iceberg, their metadata read time became a small fraction of total query time, whereas previously it was a big chunk (with Hive metastores struggling). As a quantitative example, if a query needs to scan 1 out of 1000 partitions, Iceberg will open maybe a couple of small files to find which data files to read, whereas a Hive approach might have to list and open hundreds of file system directories. This is a <strong>big win in cloud object stores</strong> which have non-trivial latency for list operations.<br></p></li><li><p><strong>Predicate Pushdown and File Skipping:</strong> Iceberg stores min/max statistics for columns at the file level in manifests. So if a query has a filter like WHERE date = '2025-04-18', Iceberg can avoid any manifest (hence data files) that don&#8217;t include that date in their stats. This is on top of Parquet&#8217;s page-level stats &#8211; it&#8217;s a higher-level pruning. Engines like Spark and Trino leverage this for extra efficiency. It&#8217;s like partition pruning on steroids (since it&#8217;s not limited to just partition columns, it can use any column stats present).<br></p></li><li><p><strong>Parallelism and Scale-Out:</strong> Iceberg is designed to work with distributed engines. The planning (reading manifest files) can itself be done in parallel by the engine (many engines will read manifests using multiple threads or tasks). Once the list of data files is known, they&#8217;re distributed among worker nodes to read. There is no single choke point in Iceberg &#8211; the metadata is stored distributed (many manifest files) so planning can scale. Some very large tables might have hundreds of manifest files per snapshot; reading them is embarrassingly parallel. This ensures that as your data grows, query planning doesn&#8217;t become a bottleneck.<br></p></li><li><p><strong>Resource Footprint:</strong> The metadata files are typically small JSON and Avro files; memory overhead to hold the file list is moderate and controllable (Iceberg will split manifests to keep each a reasonable size). This means an engine can plan a huge table in constant memory by streaming through manifests. It won&#8217;t blow the driver heap, etc. It&#8217;s a subtle but important architectural point for stability.<br></p></li></ul><p><strong>Ecosystem Synergy:</strong> Because Iceberg handles the hard parts of metadata and atomicity, each tool in the ecosystem can focus on its strength (fast SQL execution, or stream processing, or UI, etc.) and rely on Iceberg for data correctness. This synergy is what we observed:</p><ul><li><p>DuckDB could quickly add remote data querying because Iceberg presented a stable REST interface.<br></p></li><li><p>Snowflake and BigQuery could relatively easily incorporate Iceberg because they just consume the metadata and attach their compute; they didn&#8217;t have to reinvent the wheel for how to track table versions.<br></p></li><li><p>Tools like SageMaker Feature Store gained huge performance boosts by using Iceberg&#8217;s compaction and file tracking instead of a naive &#8220;dump files to S3&#8221; approach.<br></p></li><li><p>Streaming engines and batch engines coordinate implicitly: Flink writes to table -&gt; Spark reads it -&gt; Trino serves it, all orchestrated by Iceberg&#8217;s snapshot mechanism.<br></p></li></ul><p>In effect, Apache Iceberg provides a <strong>common language of data</strong> that all these different systems speak. It abstracts the physical data layout and provides a consistent view and set of operations. This is why we often call Iceberg the <strong>data platform</strong> or <strong>data layer</strong> in a modern architecture.</p><p>To conclude, Iceberg&#8217;s architecture delivers <strong>reliability, openness, and performance at scale</strong>. Its careful design (inspired by lessons from big data&#8217;s first decade) removes previous inefficiencies (like the old Hive approach) and replaces them with robust, performant techniques. The result is a virtuous cycle: as more engines adopt Iceberg, it becomes more valuable, and as more organizations use it in novel ways, the community improves it further (for example, upcoming Iceberg spec versions will add new features like row-level column encryption, finer-grained change tracking, etc., driven by real-world needs).</p><p>Apache Iceberg, backed by this solid architecture and thriving ecosystem, truly realizes the vision of a <strong>unified, high-performance data lakehouse</strong>. It empowers organizations to build data platforms that cater to BI analytics and AI/ML alike, in cloud, on-prem, or hybrid settings, all on top of open and vendor-neutral infrastructure. With the integration of engines like DuckDB and the support from cloud giants, Iceberg&#8217;s role in modern data stacks is set only to expand in the coming years &#8211; bringing ever more flexibility and power to data professionals.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Sizing Big Data Workloads: Key Numbers To Know]]></title><description><![CDATA[A cheat sheet for scoping big data workloads across compute, storage, and IO.]]></description><link>https://blog.twingdata.com/p/sizing-big-data-workloads-key-numbers</link><guid isPermaLink="false">https://blog.twingdata.com/p/sizing-big-data-workloads-key-numbers</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Wed, 21 May 2025 14:58:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/79984a14-c6ee-4f7a-a649-f81f9f5a71d5_1220x2808.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>During the course of our work at <a href="https://www.twingdata.com/">Twing Data</a>, we&#8217;ve handled a wide range of data systems. One involved ingesting hundreds of billions of events through Kafka and writing them to S3. In another project, the emphasis was on real-time latency, but with lower volumes. In another, the core goal was querying terabytes of data stored on S3 in a cost-efficient way.</p><p>We put the following table together as a quick reference when architecting a data workload. The reference instance in each case is a 64 vCPU AWS Graviton instance suited for a typical workload and shows the key metrics and notes the first thing that will cause problems. It&#8217;s meant for a quick back of the napkin sizing rather than a full capacity plan.</p><p>This was a balancing act as there are so many dimensions at play: the details of the workload, the gamut of available instance types, each with their own EBS/NVMe configurations. Trying to capture it all in a single table would be impossible. The goal is to treat this as a starting point, then benchmark and test your workloads before committing to an approach.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.datawrapper.de/_/jEsi4/" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C5of!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png 424w, https://substackcdn.com/image/fetch/$s_!C5of!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png 848w, https://substackcdn.com/image/fetch/$s_!C5of!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png 1272w, https://substackcdn.com/image/fetch/$s_!C5of!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C5of!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png" width="1200" height="1126.6483516483515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1367,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:1148701,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://www.datawrapper.de/_/jEsi4/&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/164087861?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C5of!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png 424w, https://substackcdn.com/image/fetch/$s_!C5of!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png 848w, https://substackcdn.com/image/fetch/$s_!C5of!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png 1272w, https://substackcdn.com/image/fetch/$s_!C5of!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d4fec17-b89e-4836-b2e2-e490bdf36dbc_2640x2478.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.datawrapper.de/_/jEsi4/">Click for the non-image version</a></figcaption></figure></div><p></p>]]></content:encoded></item><item><title><![CDATA[Handling Personal Data Deletion in AdTech Data Systems]]></title><description><![CDATA[We&#8217;re outlining four approaches to data deletion and evaluating tradeoffs between them.]]></description><link>https://blog.twingdata.com/p/handling-personal-data-deletion-in</link><guid isPermaLink="false">https://blog.twingdata.com/p/handling-personal-data-deletion-in</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Wed, 14 May 2025 15:47:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Hzyb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In many software engineering disciplines, data can be ephemeral. Say you have a buggy API &#8212; you can deploy a new version and the past is forgotten. But data engineering has a lot more state, and is far more sticky as a result. Once you collect some data, it&#8217;s likely that it will stick around for years and you&#8217;re going to have to support it.</p><p>A common challenge every company has to deal with is personal data. As a consumer, you want to make sure companies are protecting your data, or ideally not even storing it in the first place. Various regions, countries, states, etc have their own regulations. Europe has GDPR, California has the CCPA, Canada has PIPEDA, Brazil has LGPD, and so on.</p><p>Adtech data tends to not be as sensitive as others &#8212; we&#8217;re not dealing with government identities or payment information, after all &#8212; but there is the matter of user browsing data. For example, if you&#8217;re using a browser that supports third party cookies (like Chrome), your behavior can be tracked from site to site, and companies pay to generate a targeting profile for you. If you&#8217;re using a browser that only supports first party cookies (like Safari) you can&#8217;t be tracked as easily across sites, but your browsing behavior is still captured within the same site.</p><p>For this reason, the aforementioned regulations often have a requirement around data deletion requests. If you&#8217;re storing this personal data, you have an obligation to delete it when asked. The types of data that qualifies as personal data varies between companies and industries, but for adtech it&#8217;s typically the user id, IP address, user agent &#8212; anything that allows you to identify and track individual users.</p><p>Going back to deletions, in a typical database, the deletion would be a simple &#8220;DELETE FROM TABLE WHERE user_id = &#8216;abc&#8217;&#8221; &#8212; but adtech data isn&#8217;t stored in a simple database. Instead, we have hundreds of billions of records that have some personal data captured each day and for cost reasons usually stored on S3 in some compressed and analytics-optimized format, such as Parquet.</p><p>So what do you do when a user wants to delete their data? The cookie that&#8217;s stored in the browser is easy to delete, but what do you do with the trillions of rows scattered across these Parquet files going back for months? You can run the &#8220;DELETE FROM TABLE .. &#8220; command above, but what it will do is read the terabytes/petabytes of data into memory, filter and remove the relevant rows, then write it back to S3. It&#8217;s doable, but it&#8217;s slow and incredibly expensive. Think on the order of thousands or tens of thousands of dollars for a single user deletion request. It&#8217;s not obvious how to do it.</p><h2>Four Approaches To Data Deletion</h2><h3><strong>Don&#8217;t store the data at all</strong></h3><p>What&#8217;s not stored doesn&#8217;t need to be removed. But that comes at a cost: the data is needed to run a variety of analyses. Being able to count the distinct numbers of users, calculate trends, or use the data to cluster users to create these advertiser audiences in order to improve your ad targeting and hopefully improve the return of your ad spend.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hzyb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hzyb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png 424w, https://substackcdn.com/image/fetch/$s_!Hzyb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png 848w, https://substackcdn.com/image/fetch/$s_!Hzyb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png 1272w, https://substackcdn.com/image/fetch/$s_!Hzyb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hzyb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png" width="687" height="316" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:316,&quot;width&quot;:687,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hzyb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png 424w, https://substackcdn.com/image/fetch/$s_!Hzyb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png 848w, https://substackcdn.com/image/fetch/$s_!Hzyb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png 1272w, https://substackcdn.com/image/fetch/$s_!Hzyb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300ad526-bfc8-4093-be06-d5e871d2fb2a_687x316.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Best for:</em> Strict compliance and lowest storage and compute cost.</p><p><em>Tradeoff: </em>The data is not available for analysis.</p><h3><strong>Store the data for a limited time</strong></h3><p>Alternatively, store the data for the minimum time you need, and then purge the rest. There&#8217;s still a cost to the removal, but if you only retain this data for a week then your deletion query only needs to run for a week&#8217;s worth of data.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eTJk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eTJk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png 424w, https://substackcdn.com/image/fetch/$s_!eTJk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png 848w, https://substackcdn.com/image/fetch/$s_!eTJk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png 1272w, https://substackcdn.com/image/fetch/$s_!eTJk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eTJk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png" width="678" height="231" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea178c20-555f-4d58-9e7e-117e5551753b_678x231.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:231,&quot;width&quot;:678,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eTJk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png 424w, https://substackcdn.com/image/fetch/$s_!eTJk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png 848w, https://substackcdn.com/image/fetch/$s_!eTJk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png 1272w, https://substackcdn.com/image/fetch/$s_!eTJk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea178c20-555f-4d58-9e7e-117e5551753b_678x231.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We can extend the above to be even better by storing the personal data separately. Instead of storing the personal data in the same files as the rest we can split it into multiple file types with their own retention policies. That way the personal data can have a shorter retention window and you avoid having to run the deletion request on the full dataset. This allows the non-personal data to still be used for analysis and investigations without incurring the cost of storing the personal data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Msga!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Msga!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png 424w, https://substackcdn.com/image/fetch/$s_!Msga!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png 848w, https://substackcdn.com/image/fetch/$s_!Msga!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png 1272w, https://substackcdn.com/image/fetch/$s_!Msga!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Msga!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png" width="661" height="286" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63526687-255b-4326-b7fc-cac6ebebd357_661x286.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:286,&quot;width&quot;:661,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Msga!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png 424w, https://substackcdn.com/image/fetch/$s_!Msga!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png 848w, https://substackcdn.com/image/fetch/$s_!Msga!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png 1272w, https://substackcdn.com/image/fetch/$s_!Msga!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63526687-255b-4326-b7fc-cac6ebebd357_661x286.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Best for: </em>Having some personal data available with low operational costs.</p><p><em>Tradeoff</em>: Deletion still costs money, but the cost and likelihood of needing to run a deletion is low.</p><h3><strong>Store hashed values instead of original values</strong></h3><p>With this approach, the data isn&#8217;t tied back to the original user, but you are still able to run the bulk of the analysis you did originally. You still delete the cookie in the browser, but at that point the link is broken and you can let the data naturally phase out based on your retention policies. The key here is that if a deletion request comes in, you only have to remove it from the browser. No need to worry about the data stored in S3.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TIyo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TIyo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png 424w, https://substackcdn.com/image/fetch/$s_!TIyo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png 848w, https://substackcdn.com/image/fetch/$s_!TIyo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png 1272w, https://substackcdn.com/image/fetch/$s_!TIyo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TIyo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png" width="684" height="205" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:205,&quot;width&quot;:684,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TIyo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png 424w, https://substackcdn.com/image/fetch/$s_!TIyo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png 848w, https://substackcdn.com/image/fetch/$s_!TIyo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png 1272w, https://substackcdn.com/image/fetch/$s_!TIyo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff05b8631-7231-41bc-84b6-91a81ffe6df6_684x205.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>Best for: </em>Having long term pseudonymized personal data available for a subset of analyses (counts, discount counts, various grouping).</p><p><em>Tradeoff</em>: You cannot easily map this data back to the original user id, which limits the ability to associate properties with users &#8212; for example, associating users with a specific advertising audience.</p><h3><strong>Use a mapping table</strong></h3><p>The challenge with the one way hash approach is that it&#8217;s still deterministic. If you know what the user&#8217;s cookie id was and the hashing approach, you can rederive the hashed value. But we can make the association truly random by generating a completely random id, then using a mapping table to connect the cookie id to a random id that will be what&#8217;s stored in the S3 logs. When a user submits a deletion request, we purge the mapping entry, breaking the link similar to the prior approach, also letting it phase out naturally with the retention policy. Similar to the solution above, we remove it from a much cheaper and quicker data store &#8212; the mapping table &#8212; without having to touch anything on S3. As part of this step, we can also do full anonymization on the fields we don&#8217;t need details for; for example, just mapping user agent to a browser version, and stripping the last octet of the IP address.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5kmZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5kmZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png 424w, https://substackcdn.com/image/fetch/$s_!5kmZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png 848w, https://substackcdn.com/image/fetch/$s_!5kmZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png 1272w, https://substackcdn.com/image/fetch/$s_!5kmZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5kmZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png" width="680" height="301" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:301,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5kmZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png 424w, https://substackcdn.com/image/fetch/$s_!5kmZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png 848w, https://substackcdn.com/image/fetch/$s_!5kmZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png 1272w, https://substackcdn.com/image/fetch/$s_!5kmZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a3c07b-a936-4037-8b79-e35c72804bea_680x301.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Best for: </em>Being able to do advanced and robust analysis and audience creation.</p><p><em>Tradeoff:</em> This is more operationally complex and costly than the others.</p><p>For each of these strategies, it&#8217;s all about evaluating tradeoffs. Product and data science teams typically want as much data as they can get, since it&#8217;s what they use to innovate. On the other hand, legal and compliance teams want to limit the risk. And the data engineering teams are charged with walking that tightrope, all while keeping costs down so they can work on other problems. It&#8217;s no easy feat, but we hope these approaches help you think through the best approach for your team.</p>]]></content:encoded></item><item><title><![CDATA[On the Knife’s Edge in Data and AI: Takeaways from Data Council 2025 ]]></title><description><![CDATA[Last month in Oakland, Data Council 2025 felt like an early unveiling of the next gen, AI&#8209;native data stack.]]></description><link>https://blog.twingdata.com/p/on-the-knifes-edge-in-data-and-ai</link><guid isPermaLink="false">https://blog.twingdata.com/p/on-the-knifes-edge-in-data-and-ai</guid><dc:creator><![CDATA[Adam Kabak]]></dc:creator><pubDate>Fri, 09 May 2025 22:12:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Fcvn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fcvn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fcvn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Fcvn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Fcvn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Fcvn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fcvn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2405523,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/163241662?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fcvn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Fcvn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Fcvn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Fcvn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49630d2a-a646-427b-aac6-5bf07d14ecc4_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Last month in Oakland, Data&#8239;Council&#8239;2025 felt like an early unveiling of the next gen, AI&#8209;native data stack. Over three packed days, database legends, AI researchers, and bleeding-edge startup founders converged on a single storyline: our monolithic warehouses are being deconstructed into Lego bricks, and those bricks are being animated by self-driving logic. In this new era, sub-second interactivity/responsiveness has gone from luxury to table stakes. Soon, agents won&#8217;t only patrol BI dashboards, but run entire data pipelines, with ironclad observability and self-healing capabilities. For those that missed this awesome conference, here are some of my major takeaways.<br></p><h3>The Future Isn&#8217;t Just Real Time, It&#8217;s Responsive</h3><p>There was a big turnout by companies that provide low-latency data infrastructure. As I talked to their reps about features like Firebolt&#8217;s sub-second queries, ClickHouse&#8217;s focus on real-time analytics, MotherDuck&#8217;s instant SQL previews, and local-first DuckDB in the browser, it became clear that users can now expect interactive, immediate results, whether from dashboards, apps, or AI assistants. And not just over analytics data &#8211; as Aaron Katz (CEO @ Clickhouse) noted in Day 2&#8217;s keynote fireside chat, OLAP will remain core, but OLTP and OLAP are unbundling and blending. While Hybrid Transactional/Analytical Processing databases (HTAP) are not a new concept, advancements in technology and architecture have made them increasingly compelling.</p><p>Still, the perception of latency is relative. In one talk by Ethan Brown (Twitch), their new internal analytics slackbot handles masses of natural language analytics requests from several teams, at all hours of day and night. These requests used to be handled manually by analysts, becoming an unsustainable drain on resources. Even with 60 seconds of latency from the AI slackbot, it still <em>feels</em> real time. Teams who were used to waiting hours are thrilled, and now analysts can finally sleep through the night. To hammer this point home, Nikunj Handa (OpenAI) spoke on OAI&#8217;s evolving Responses API, citing an observed preference of trading latency for quality as the decision to make the API agentic by default. A single API call triggers an agentic loop that plans, iterates, and uses provided (and preloaded) tools to produce the best possible results.</p><p>Overall, developer platforms can now prioritize asynchronous, near-real-time architectures. Whether it&#8217;s supporting WebSocket query streams, or building UIs that update incrementally (like streaming SQL results), the tools that make data work feel instantaneous will set the new standard. This also means data engineers should get comfortable with event-driven patterns and local caching.</p><p></p><h3>The Semantic Layer is Back: AI and Self Service Demand It</h3><p>In the last decade, the idea of a separate semantic abstraction faded as the industry chased ELT and &#8220;schema&#8239;on&#8239;read&#8221; in data lakes. Sure, tools like Looker had some popularity, but most teams leaned into dbt/Jinja for transformations, treating &#8220;metrics&#8221; as just another SQL recipe.</p><p><em>(As you can imagine, this resurgence of interest has Looker&#8217;s creator, Lloyd Tabb, hyped! Infectiously enthusiastic, he was there to show off his new project, Malloy, that takes the semantic layer to the next level by embedding reusable, object&#8209;oriented query definitions)</em></p><p>I promise I won&#8217;t bring up my company, Twing Data, repeatedly, but it&#8217;s important to support the point. From what we see, even with DBT, data fragmentation happens easily and is extremely difficult to pinpoint and fix. As data consumption diversifies (from humans in a notebook, to AI agents in Slack, and well beyond), the risk of fragmentation increases. So does the need for a unified semantic layer to minimize that risk.</p><p>I saw great presentations from the aforementioned Lloyd Tabb (Malloy) and Michael Driscoll (Rill Data), representing the next-gen semantic layer and metric store, respectively.</p><h4>Semantic Layer: Malloy</h4><p>When every dashboard or AI bot defines &#8220;revenue&#8221; in its own SQL, you get drift. Malloy instead lets you declare &#8220;revenue&#8221; once, with explicit grain and business logic built in, so a report in Looker, a DuckDB notebook, or an LLM&#8209;powered Slackbot all compute the identical number</p><h4>Metric Store: Rill</h4><p>Bridging OLTP, OLAP, and AI, Rill&#8239;Data&#8217;s SQL&#8209;based metrics layer similarly centralizes KPI definitions, then automatically monitors them for anomalies and powers real&#8209;time dashboards. That same layer can be queried by AI agents or BI UIs, ensuring every consumer sees the same semantics and alert rules, all without custom code per tool.</p><p>Also worth mentioning, and as a segue to the next takeaway, after hearing Julien Le Dem&#8217;s (co-creator of Parquet and Arrow) talk on open and composable data systems, it seems plausible to contain semantic definitions within an Arrow-based in-memory layer, or simply an Iceberg table representing the single source of truth. With emerging deep interoperability, these can work with a variety of query engines as needed.</p><p>In this AI-native era, semantic layers/metric stores are the glue that holds together various databases, self-service analytics and autonomous agents. Define your metrics once, govern them centrally, and watch as every front&#8209;end, whether human or machine, speaks the same language.</p><p></p><h3>Interoperability, Modularity, and the Deconstructed Stack</h3><p>Thanks to talks by Julien Le Dem and others, it became clear that the modern data stack is comparable to Lego bricks. Want a new ML feature store? Plug it into your lake via Parquet. Need a faster query engine? Swap it in! Your data is in Iceberg.</p><p>This open source modularity is enabling small team to do big things. It&#8217;s also shifting value away from proprietary systems/storage formats and towards workflow orchestrated and user experience on top of common building blocks. Here are some examples:</p><ul><li><p>DuckDB is being embedded everywhere (Malloy, Hex, and Rill use it under the hood, Rill uses it in-browser for local development, and MotherDuck does cloud DuckDB). DuckDB + Parquet/Arrow is like the SQLite of data.</p></li><li><p>Iceberg, Arrow and Parquet have hit escape velocity. From python and Spark to BigQuery and Snowflake, these formats tie every engine and language together. Support for Iceberg isn&#8217;t universal quite yet, but cloud providers are racing to bake this in natively. As Julien Le&#8239;Dem noted, &#8220;blob storage is becoming table&#8209;aware.&#8221; With transactions, schema evolution, time travel and catalog services, Iceberg turns S3/GCS into a fully-featured table store.</p></li><li><p>The stack is becoming syntax-agnostic, mixing SQL, Python, and more. With Bauplan, you can define models and pipelines in python. On the analytics side, Hex&#8217;s multi-modal workbooks support a combination of languages in a single workflow.</p></li></ul><p>Apache Arrow was the unequivocal hero of nearly every session on modular data architectures. I learned it underpins the data-movement for products at every scale. On the larger end, Ganesh from Hex did a medium-depth dive on how its DAG&#8209;based &#8220;tablestore&#8221; persists results as Arrow tables so SQL and Python cells can execute in parallel. Elias from SDF (recently acquired by DBT) showed how Arrow enables the generated query plan to execute against any engine that supports it, so a model validated on Snowflake can be test-run locally in DuckDB, for instance. Earlier in its journey, Okta uses arrow to transfer data from shards to s3/Iceberg for analytics. And the aforementioned Bauplan also uses Arrow when moving Iceberg tables in and out of a lambda.</p><p>Disclaimer: seizing this new composability comes with a caveat: ripping out a legacy warehouse overnight is probably suicide. Migrating decades of ETL, dashboards, and access controls into Iceberg tables or a DuckDB&#8209;backed semantic layer takes planning and incremental roll&#8209;outs. But for any new initiatives, it certainly seems like the ideal bandwagon to hop on.</p><p></p><h3>Agents All The Way Down: AI Across the Stack</h3><p>The term &#8220;agent&#8221; was everywhere. From building reliable agentic systems, to OpenAI&#8217;s agentic API, to RAG/testing/benchmarking frameworks, to infrastructure applications, the entire community is clearly excited about the potential, and also sober about the challenges.</p><h4>AI as the Data Consumption Layer</h4><p>As a baseline, RAG is the standard for agents. Nearly every AI use case used RAG to ground the model with knowledge. The unanimous advice among Tengu Ma (Voyage AI), Skylar Thomas (<a href="http://cake.ai">Cake.ai</a>), a panel of AI leaders, and several other talks, was to skip naive vector search entirely, and to go straight to hybrid (embeddings + keywords), and to use rerankers. An agent is only as good as its knowledge base, and building that base is a first-class engineering task. Lots of great techniques for data processing/retrieval were discussed. Skylar from <a href="http://cake.ai">Cake.ai</a> gave an excellent (and fast, and dense) talk on RAG at scale, where he suggested beefing up retrieval with LiteRAG and knowledge graphs, suggesting &gt; 2 types of retrieval across multiple databases. And Tengu from Voyage AI suggested various ways to process long documents with metadata like neighboring chunk summaries to increase retrieval quality.</p><p>In the spectrum of chains/loops vs end to end models, it seems the trend is moving away from fully-scripted chains of LLM calls toward giving the model more autonomy to plan, use tools, evaluate, and iterate dynamically. For instance, the internal Twitch slackbot used a selector agent to pick from several relatively deterministic chains. Meanwhile, OpenAI is rolling out the Responses API that allows dynamic loops with tools from a single call. And, somewhere in between, the presentation on Harvey AI, a specialized agentic platform for legal work, demonstrated a delicate balance between humans in the loop with dynamic agents.</p><p>The strongest point of consensus, as ever, was observability and evals. An example of a simple check was to run EXPLAIN on a query before running it within a text to sql system. Deeper, more elaborate systems were a major topic of Skylar&#8217;s talk, like running regression tests against eval datasets, with the idea being that the qualifying metric for &#8220;good&#8221; results must be defined and tested against to measure progress and alert to issues. Lastly, human review is an inevitability, especially for new or domain-specific systems.</p><h4>AI as System Architecture</h4><p>Considering the catastrophic risk of mangled data, the idea of agents building and running pipelines might sound like sci&#8209;fi. After all, AI still hallucinates now and then, right? Believe it or not, we&#8217;re closer to solving that than most realize. If you&#8217;ve built a RAG system, you know the power of well-targeted context. The challenge has always been acquiring that context.</p><p>Historically, your data &#8220;context&#8221;&#8239;&#8211; the rich metadata, semantic rules, transformation logic and business definitions that make raw tables meaningful&#8239;&#8211; has been scattered across silos: pipeline orchestration tools like Airflow only capture task dependencies and schedules, while transformation logic lives as standalone SQL files in dbt or custom scripts without any machine&#8209;readable declaration of input/output schemas or business intent; semantic layers in BI platforms such as LookML, Power&#8239;BI datasets or Tableau models define dimensions and measures behind proprietary GUIs with no unified registry; cloud warehouses (Redshift, Snowflake, BigQuery) expose schema and lineage via vendor&#8209;specific system tables and APIs that know nothing of downstream semantics; and any hand&#8209;written documentation &#8211; in Confluence pages, data dictionaries or spreadsheets &#8211; quickly grows stale and disconnected from the code that actually runs. Because all these pieces live in different systems, in different formats, and often behind closed&#8209;source UIs or proprietary APIs, an AI agent can&#8217;t simply &#8220;look up&#8221; the context it needs.</p><p>Today, though, that context can all exist as normal code &#8211; version&#8209;controlled, semantically defined, and instantly retrievable by an AI agent. This is because end to end data systems are being fully redesigned as composable code artifacts by several cutting edge companies. At Rill Data, DuckDB and ClickHouse power instant, streaming dashboards, in&#8209;browser, that reflect metrics defined both semantically and as code. Further, Bauplan takes the version&#8209;controlled, data&#8209;as&#8209;code paradigm beyond modeling and metrics to the entire pipeline infrastructure.</p><p>Pair this with self-healing over deep observability, along with the ever-increasing model quality and tool usage abilities, and the future is unmistakable: the tools of tomorrow won&#8217;t just store or query data&#8212;they&#8217;ll reason over it, monitor themselves, and act in real time. Considering the unparalleled exponentiality of growth we&#8217;ve seen so far, it&#8217;s clear to me that tomorrow is right around the corner.</p><p></p><h3>The Future is What We Make It</h3><p>In his talk, Julian Le Dem reminisced about the past excitement around the potential of native data applications as the next wave of value creation. He pointed out that AI&#8217;s emergence as the new consumption layer tempered that wave. Whether native data apps are swallowed completely by ubiquitously embedded AI or not is too early to tell. But the industry isn&#8217;t holding their breath to find out. The trend towards modularity, a unified semantic layer, hyper-fast databases and responsive front ends all help enable AI to more deeply integrate throughout and across systems.</p><p>Of course, many see AI as a threat &#8211; to their jobs, to humanity even &#8211; but at Data Council 2025, the vibe was infectious, energetic excitement. Perhaps the biggest takeaway of all is that it&#8217;s still so early that no one really knows what AI&#8217;s role in our world will look like, in any of its applications. I can certainly see the negative viewpoint, but Data Council 2025 was an affirmation that there&#8217;s nowhere I&#8217;d rather be than on the bleeding edge, helping to shape what this awesome technology becomes.</p>]]></content:encoded></item><item><title><![CDATA[Building for AWS Marketplace]]></title><description><![CDATA[Twing Data is now available in the AWS Marketplace!]]></description><link>https://blog.twingdata.com/p/building-for-aws-marketplace</link><guid isPermaLink="false">https://blog.twingdata.com/p/building-for-aws-marketplace</guid><dc:creator><![CDATA[Simon]]></dc:creator><pubDate>Wed, 30 Apr 2025 18:36:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XbS8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XbS8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png 424w, https://substackcdn.com/image/fetch/$s_!XbS8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png 848w, https://substackcdn.com/image/fetch/$s_!XbS8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png 1272w, https://substackcdn.com/image/fetch/$s_!XbS8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XbS8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png" width="1456" height="494" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:494,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:291231,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/162478891?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XbS8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png 424w, https://substackcdn.com/image/fetch/$s_!XbS8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png 848w, https://substackcdn.com/image/fetch/$s_!XbS8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png 1272w, https://substackcdn.com/image/fetch/$s_!XbS8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec271317-97ee-44e1-9712-47f573be0d2b_1762x598.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One of our <a href="https://blog.twingdata.com/p/why-snowflake-native">recent blog posts</a> detailed how and why we built a Snowflake marketplace application &#8212; but that wasn&#8217;t the only marketplace we&#8217;ve been building for. Twing Data has also made its way into the <a href="https://aws.amazon.com/marketplace/pp/prodview-rzp5dxlbwjxta">AWS Marketplace</a>! This article outlines our experience with getting this integration set up, the technical approach we took, and technical challenges that we came across.</p><h2>What is the AWS Marketplace?</h2><p>Like the Apple Store or Google Play Store, The AWS Marketplace is a curated digital marketplace that allows third-party developers to sell different types of software and services such as interesting data sets, development containers, machine learning models, and in our case &#8212; SaaS.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The SaaS Marketplace operates similarly to other ecosystems that enable third-party developers to integrate payments and authentication through the platform, while increasing product visibility with a page highlighting the features of the product.</p><p>For the buyer, AWS Marketplace offers important advantages, like flexible pricing and contract options that allow for pre-defined terms (with different pricing models) to avoid surprises. There are also options to create private contracts that can be fine-tuned further for specific buyers. Another bonus is that billing is consolidated through AWS. Finally, there&#8217;s a lot of value in knowing that the application has been through AWS&#8217;s curation and quality assurance process, and abides by their terms. Different organizations have different compliance and billing processes, so being vetted by and billed through AWS helps streamline adoption.</p><h2>Twing Data AWS Marketplace App: Technical Details</h2><p>For those interested in the technical implementation of this integration, here&#8217;s an overview of how it works.</p><p>Our SaaS offering is classified as &#8220;Contract with Consumption&#8221; meaning that AWS facilitates a contract between Twing Data and a buyer, and Twing Data is responsible for tracking usage (specifically, number of Redshift queries analyzed) and reporting it to AWS for billing, in case of overages. A buyer can update their contract to increase their usage caps (but not decrease).</p><p>The first step to get started is to create the listing, following the official <a href="https://catalog.us-east-1.prod.workshops.aws/workshops/cc168d1f-b5c7-4b68-a2f9-d785d68b2f19/en-US/saas/create-listing/">&#8220;Lab: Create a SaaS listing&#8221; guide</a>.</p><p>The<a href="https://catalog.us-east-1.prod.workshops.aws/workshops/cc168d1f-b5c7-4b68-a2f9-d785d68b2f19/en-US/saas/integration"> "Lab: Integrate your SaaS" guide</a> was an invaluable technical resource when designing and building this out, but assumes your application runs on AWS Lambda. In our use case we had to adapt the guide to work with our Remix application (soon to be React Router) and asynchronous analysis (in Python).</p><p>The linked guide is organized into three parts:</p><ol><li><p>Integrate with buyer onboarding</p></li><li><p>Get entitlement updates</p></li><li><p>Get subscription updates and report usage</p></li></ol><p>Let&#8217;s go into some details on each one.</p><h3>Integrate with buyer onboarding</h3><p>The first part of the guide acted as a template for linking the buyer&#8217;s AWS account and their Twing Data account. More directly, AWS redirects the buyer to the landing page URL which makes a POST request to a new Twing Data app route. This request includes a &#8220;x-amzn-marketplace-token&#8221; which we can use to get the buyer&#8217;s AWS account information for later.</p><p>In classic frontend development fashion, there are a few factors to consider: Does the buyer already have a Twing Data account? Is the buyer logged in? If they&#8217;re not logged in, how do we make this link?</p><p>These questions are answered by using the token header to get the AWS information, and store it in our own database. If the user is logged in, we can link their AWS account to their Twing Data account immediately, otherwise we redirect the user login/signup and make the link after. In each case, we also initialize information about their contracts&#8217; limitations and usages so far.</p><h3>Get entitlement updates</h3><p>For the Contract with Consumption pricing model, it&#8217;s necessary to listen for events from AWS about contract terms changing; for example when a subscription is created or updated with increased limits.</p><p>The guide walks you through setting up an email subscription for the SNS topic, but the production implementation is effectively left as an exercise for the reader. It can be difficult to debug so the less unknowns there are, the better. We created a new route in our Remix app to handle this event and set the new URL as the subscription endpoint (making sure to use the correct protocol and &#8220;www&#8221; subdomain if required &#8211; otherwise your endpoint won&#8217;t get messages!)</p><p>HTTPS subscriptions need to be &#8220;Confirmed&#8221; which is a one-time process that tells AWS the endpoint is set up properly. Since it&#8217;s a one-time thing, we simply did the &#8220;Manual confirmation&#8221; that the <a href="https://docs.aws.amazon.com/sns/latest/dg/SendMessageToHttp.confirm.html">SNS documentation</a> describes.</p><p>Ultimately, this endpoint receives the &#8220;entitlement-updated&#8221; notification, and uses the provided customer-identifier and product-code to see what entitlements were created or changed. Again, the SaaS guide walks through a manual workflow using the AWS cli, but it was straightforward to adapt to use the <a href="https://www.npmjs.com/package/@aws-sdk/client-marketplace-entitlement-service">@aws-sdk/client-marketplace-entitlement-service</a> nodejs package.</p><h3>Get subscription updates</h3><p>The last part of the SaaS guide, &#8220;Report usage to charge the buyer&#8221; also walks you through creating another subscription, for listening to the aws-mp-subscription-notification. It&#8217;s important enough on its own so I decided to treat it as its own section.</p><p>For the consumption-based pricing models (&#8220;Contract with Consumption&#8221; and &#8220;Subscription (pay-as-you-go)&#8221;, consumption is tracked while the subscription is active, so we need to track not only usage (which we&#8217;ll go through below) but also changes to the subscription status. For this, we created a new route for the aws-mp-subscription-notification topic, and followed a similar process as the previous section to update an AWS user&#8217;s subscription on our side.</p><h3>Report usage</h3><p>Lastly, we need to track usage (number of queries analyzed) so we can report usage to AWS for billing.</p><p>Since usage reports are expected to be no more frequent than hourly, (&#8220;Timestamps that fall into the same hour are truncated to that hour.&#8221;), the logic between &#8220;tracking&#8221; and &#8220;reporting&#8221; are decoupled &#8211; we track number of queries as they&#8217;re analyzed, and then send consumption reports as needed to make sure they&#8217;re accurate and avoid any getting dropped.</p><h3>End-to-end testing</h3><p>Testing this flow turned out to be a bit cumbersome due to having to &#8220;test in production&#8221; &#8211; the seller and buyer account is the same during testing so to test a new contract, for example, the previous one has to be canceled. As of this writing, this is only done through contacting the AWS Marketplace support team.</p><p>To alleviate this friction, the different flows are covered by our automated testing suite that runs on each pull request to ensure there are no regressions to this critical feature. Of course, automated testing is a good practice regardless, but having to wait for canceling test contracts takes time so this is practically a requirement to shorten this feedback loop (and as a bonus, alleviate the AWS Marketplace support team&#8217;s workload).</p><p>Since you&#8217;re &#8220;testing in prod&#8221;, you end up being billed, but the price can be set low (and is said to be refundable):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jkpl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jkpl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png 424w, https://substackcdn.com/image/fetch/$s_!Jkpl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png 848w, https://substackcdn.com/image/fetch/$s_!Jkpl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png 1272w, https://substackcdn.com/image/fetch/$s_!Jkpl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jkpl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png" width="1178" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:1178,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jkpl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png 424w, https://substackcdn.com/image/fetch/$s_!Jkpl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png 848w, https://substackcdn.com/image/fetch/$s_!Jkpl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png 1272w, https://substackcdn.com/image/fetch/$s_!Jkpl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9fd5a50-306c-4bb4-b156-64bc510ba37b_1178x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Give our App a Try</h2><p>Similar to our case for building a Snowflake Native application, we want to meet our users where they are &#8212; Twing Data already supports Redshift, so the next step is to get that integration set up as easily and securely as possible.</p><p>Being on the AWS Marketplace simplifies procurement, accelerates deployment, and aligns with the compliance, billing, and security standards you already trust. If your team is looking for a faster, easier way to integrate Twing Data into your Redshift warehouse, the AWS Marketplace is a direct path forward.</p><h2>Helpful Links</h2><p><a href="https://docs.aws.amazon.com/marketplace/latest/buyerguide/what-is-marketplace.html">What is AWS Marketplace?</a> (AWS article)</p><p><a href="https://www.youtube.com/watch?v=HU_SJTFHcGU">What is AWS Marketplace?</a> (video)</p><p><a href="https://catalog.us-east-1.prod.workshops.aws/workshops/cc168d1f-b5c7-4b68-a2f9-d785d68b2f19/en-US/saas/integration">"Lab: Integrate your SaaS" guide</a> (primary guide referenced above)</p><p><a href="https://aws.amazon.com/marketplace/pp/prodview-rzp5dxlbwjxta">AWS Marketplace - Twing Data for Redshift: Query Analysis &amp; Optimization</a> (our live marketplace listing)</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support our work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Why Snowflake Native]]></title><description><![CDATA[Insights That Led to Twing Pulse]]></description><link>https://blog.twingdata.com/p/why-snowflake-native</link><guid isPermaLink="false">https://blog.twingdata.com/p/why-snowflake-native</guid><dc:creator><![CDATA[Adam Kabak]]></dc:creator><pubDate>Thu, 17 Apr 2025 19:45:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!al24!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!al24!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!al24!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png 424w, https://substackcdn.com/image/fetch/$s_!al24!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png 848w, https://substackcdn.com/image/fetch/$s_!al24!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png 1272w, https://substackcdn.com/image/fetch/$s_!al24!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!al24!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png" width="1456" height="687" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:687,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!al24!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png 424w, https://substackcdn.com/image/fetch/$s_!al24!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png 848w, https://substackcdn.com/image/fetch/$s_!al24!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png 1272w, https://substackcdn.com/image/fetch/$s_!al24!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365df219-edb3-4c0b-8fe3-969be860e377_1600x755.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In 2007, Apple simultaneously shattered the computer and mobile phone industries with the introduction of the iphone. In 2008, they followed up with the app store, shattering the software distribution model and catalyzing a new mobile-first economy. Seventeen years later, what started as a mobile revolution has become a full-blown cross-platform app economy, spanning not just B2C but even B2B. With products in categories like CRM (Salesforce), collaboration (Slack, Zoom, etc), project management (Trello), and analytics (Tableau), decision-makers expect to manage ops from phones, tablets, and even smartwatches.</p><p>Today, the entire app lifecycle lives on the cloud, from development and deployment, to data pipelines, payment, compliance, and even user interaction. Snowflake&#8217;s relatively new app marketplace is a refreshingly unique and opinionated twist on the AWS approach that focuses on low ops integrations for data sharing and represents the expansion from warehousing to the OS of the enterprise data stack.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Enter: Snowflake</h2><p>Snowflake launched onto the scene as a product sitting atop one particular cloud vertical, the data warehouse. Now it&#8217;s expanding horizontally with some depth in all the major data categories, like lakehouses, OLAP, data exchanges, BI/ML/AI, and more. In tandem with their product development, the user base continues to grow at a steady pace. It&#8217;s no wonder - aside from the novel warehouse UX, the prospect of doing ML, cross-cloud data sharing, etc with none of the complexity/latency/risks of transferring data to external tools is a great UVP.</p><p>Their Native App Marketplace represents Snowflake&#8217;s expansion from mere networked warehousing to being the OS of the enterprise data stack. A departure from the AWS/GCP approach to apps, its highly opinionated, deeply integrated experience is focused solely on data and ai applications that run securely within the user&#8217;s Snowflake environment.</p><p>As a user, the accessibility of downloading an app and running it on your data with zero configuration is quite an experience. Also, because the app runs entirely within your own Snowflake account, it enables the use of external tools that would otherwise be impossible on tables with sensitive or valuable data. Combined with Snowflake&#8217;s strict permission model, object isolation, and transparent resource usage, this architecture offers a level of control and trust that&#8217;s difficult to match &#8212; even with custom, self-hosted solutions.<br><br>As a developer, backend development has a new paradigm with a moderate learning curve (as expected) and the frontend, if used, is extremely limited and opinionated. While frustrating at times, these constraints also focus the product and force simplicity, which isn&#8217;t necessarily a bad thing. On top of that, we get the advantages of increased visibility, streamlined procurement, and the elimination of overhead associated with infra management or custom integrations.</p><p>Finally, while monetization is well integrated, reporting and billing aren&#8217;t quite up to the standard of AWS. <br><br>But who cares? As a data system optimization business, the advantages are obvious.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bjbl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bjbl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png 424w, https://substackcdn.com/image/fetch/$s_!Bjbl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png 848w, https://substackcdn.com/image/fetch/$s_!Bjbl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png 1272w, https://substackcdn.com/image/fetch/$s_!Bjbl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bjbl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png" width="944" height="349" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:349,&quot;width&quot;:944,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:77506,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/161486791?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bjbl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png 424w, https://substackcdn.com/image/fetch/$s_!Bjbl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png 848w, https://substackcdn.com/image/fetch/$s_!Bjbl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png 1272w, https://substackcdn.com/image/fetch/$s_!Bjbl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02074e5-9c5a-4fb7-bf5c-5404266b9412_944x349.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Business Case For Building On Snowflake</h2><h4>Procurement</h4><p>Twing Data&#8217;s core strength is its warehouse-agnostic analysis engine. Among the three large warehouses we currently support - BigQuery, Snowflake, and Redshift - there are major differences. We&#8217;ve invested deeply, and sometimes painfully, in making our systems flexible to handle each platform&#8217;s data and surface the appropriate insights. This gives us access to a wider customer base, and moves us towards our target of full end to end analysis. <br><br>The flip side: Depending on the industry/client, an external data connection comes with security/compliance concerns, and can even be a non-starter altogether. In our experience it can frequently lead to months of legal reviews and contract revisions. This is just another day for large enterprises, but we are a small team, so maintaining these ongoing open cases is not ideal for us.</p><p>This is simply not an issue if the data never leaves the client&#8217;s system. Native apps run directly in the user&#8217;s warehouse environment. A marketplace install and quick permissions agreement in the UI replace tens of hours of work by lawyers and engineers. Importantly, creators&#8217; IP is also protected, as the queries run are not exposed to the user.</p><p>Finally, aside from the security and compliance hurdles, budget approval for data tooling is a hard sell. Executives in charge of allocating budgets typically prioritize investments that directly impact sales, and saving on data spend is typically far down the list. Billing on Native Apps contributes to existing Snowflake spend, eliminating the need for data teams to request an additional separate budget.</p><h4>Visibility and Accessibility</h4><p>The platform power of app stores has been proven time and time again. Users know to look there, and apps want to be there. The Snowflake Marketplace is just a baby, with some large limitations and hurdles on the development side, but the visibility is massive. Look at these numbers:</p><ul><li><p>Snowflake Marketplace: 38 data products per monthly active user</p></li><li><p>AWS Marketplace: 7.8 apps per user</p></li></ul><p><em>(based on &#8220;active customers&#8221; who transact on a regular basis. See source below)</em></p><p>Now, consider that AWS hosts a vast range of apps across every imaginable use case. Snowflake, on the other hand, is focused tightly on data and analytics. That means a much higher percentage of Snowflake users are in a Native App&#8217;s direct target market.</p><h4>Summary of Market Opportunity</h4><p>While mobile app stores are at maturity, now growing at 15-20% YoY, B2B marketplaces are earlier in the curve, similar to an earlier phase of B2C adoption, growing at 50-100% YoY.</p><p>That&#8217;s great for the marketplaces that capitalize on volume alone, but does that trickle down to the creators? The earlier one is on the platform, the more this is true. It&#8217;s still early days on the Snowflake Marketplace. The advantages are potentially enormous, IF you can capitalize on them.</p><p>One advantage is that it&#8217;s possible to develop a closer relationship with the platform. Snowflake has a &#8220;co-sell with partners&#8221; culture, and are actively looking to showcase Native Apps in their marketplace. Other benefits could be roadmap previews, design partner programs, or inclusion in investor/demo events.<br><br></p><p>Another is that there are so few listings currently (~3000). Being early on a platform means there&#8217;s high visibility and the opportunity for category leadership, or even to define a new category. With over 100,000 monthly active users, that means there are, as previously mentioned, 38 users for every app. In a straight comparison to AWS and GCP (and any other market), these are great numbers. And given the narrow market scope and the frictionless integrations, the likelihood to connect to users is much higher on Snowflake.</p><h2>Give Our App a Try</h2><p>Procurement struggles that involve hurdles like regulatory compliance keep companies from using industry-leading tools that could potentially help unlock massive value. For the developers of those products, like us, it can be equally stifling.</p><p>So far, Snowflake Native has been a simple way to promote our features to clients natively, while protecting our IP. We're continually iterating and delivering more sophisticated features as we validate our hypotheses. If you&#8217;re a Snowflake user and are interested in deeper observability to connect how your system is actually used with Snowflake&#8217;s execution and pricing behaviors, give Twing Pulse a download. It&#8217;s free, and all we ask is that you send us some feedback. We can then prioritize the highest-impact improvements and feature migrations.</p><p>For an idea of what&#8217;s to come to Twing Pulse, reach out to us for demo links to our full platform at twingdata.com.</p><p>Have you developed for the Snowflake Native Marketplace? What was your experience? Let us know below!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Building a Snowflake Native App]]></title><description><![CDATA[Lessons from building Twing Pulse]]></description><link>https://blog.twingdata.com/p/building-a-snowflake-native-app</link><guid isPermaLink="false">https://blog.twingdata.com/p/building-a-snowflake-native-app</guid><dc:creator><![CDATA[Adam Kabak]]></dc:creator><pubDate>Wed, 09 Apr 2025 15:28:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!hSLD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hSLD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hSLD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!hSLD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!hSLD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!hSLD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hSLD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hSLD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!hSLD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!hSLD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!hSLD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c5f2971-6e8c-40f2-93d3-280433bb4251_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We just published our first Snowflake Native app &#8212; <a href="https://app.snowflake.com/marketplace/listing/GZTSZ1DS0HH/twing-data-twing-pulse">Twing Pulse</a> &#8212; to the Snowflake Marketplace! It features a few of the most useful queries from our original much-loved product, Twing Data. Now Snowflake users can access a portion of our useful warehouse and workload metrics instantly, with none of the friction or security concerns that come with external integrations. It&#8217;s also free for now, so check it out and let us know what you think!<br><br>Fresh after the experience from 0 to 1, we thought it would be helpful to condense the mental model, development processes we used, and submission/review processes into a guide for those who are also interested in building their own Snowflake app.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Mental Model</h3><p>Published Native Apps are run directly on the consumer account. When the app code is ready to publish, the developer generates or updates an application package based on that code. Then that package is published either privately or to the marketplace. Publishing requires three review processes, which we&#8217;ll cover later on.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZELi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZELi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png 424w, https://substackcdn.com/image/fetch/$s_!ZELi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png 848w, https://substackcdn.com/image/fetch/$s_!ZELi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png 1272w, https://substackcdn.com/image/fetch/$s_!ZELi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZELi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png" width="1456" height="860" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:860,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZELi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png 424w, https://substackcdn.com/image/fetch/$s_!ZELi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png 848w, https://substackcdn.com/image/fetch/$s_!ZELi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png 1272w, https://substackcdn.com/image/fetch/$s_!ZELi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd3a313-f052-47bd-9987-b508c7966a16_1600x945.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In development, it&#8217;s possible to create an application for testing directly from the application package, without publishing. Here is the flow I have been using:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cTBL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cTBL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png 424w, https://substackcdn.com/image/fetch/$s_!cTBL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png 848w, https://substackcdn.com/image/fetch/$s_!cTBL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png 1272w, https://substackcdn.com/image/fetch/$s_!cTBL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cTBL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png" width="1371" height="379" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:379,&quot;width&quot;:1371,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cTBL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png 424w, https://substackcdn.com/image/fetch/$s_!cTBL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png 848w, https://substackcdn.com/image/fetch/$s_!cTBL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png 1272w, https://substackcdn.com/image/fetch/$s_!cTBL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8169c024-7613-4db0-9dc8-3bec8715c895_1371x379.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>Initial Setup</h3><p>The file tree below includes everything you need to get a repo started:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kVjA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kVjA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png 424w, https://substackcdn.com/image/fetch/$s_!kVjA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png 848w, https://substackcdn.com/image/fetch/$s_!kVjA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png 1272w, https://substackcdn.com/image/fetch/$s_!kVjA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kVjA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png" width="913" height="408" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:408,&quot;width&quot;:913,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kVjA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png 424w, https://substackcdn.com/image/fetch/$s_!kVjA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png 848w, https://substackcdn.com/image/fetch/$s_!kVjA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png 1272w, https://substackcdn.com/image/fetch/$s_!kVjA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F254f0f9d-2163-401e-9b06-326ebe61fe66_913x408.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On the Snowflake side, it goes without saying you&#8217;ll need a Snowflake account and, depending on the type of app you&#8217;re building, some data to test with.</p><p>Because testing within Snowflake with data was such an important part of my development workflow, I set up an integration with GitHub from the outset. Here are the queries you&#8217;ll need to set up a Snowflake secret, use it to create an API integration and a Snowflake Git Repository:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7anz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7anz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png 424w, https://substackcdn.com/image/fetch/$s_!7anz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png 848w, https://substackcdn.com/image/fetch/$s_!7anz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png 1272w, https://substackcdn.com/image/fetch/$s_!7anz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7anz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png" width="1055" height="436" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:436,&quot;width&quot;:1055,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7anz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png 424w, https://substackcdn.com/image/fetch/$s_!7anz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png 848w, https://substackcdn.com/image/fetch/$s_!7anz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png 1272w, https://substackcdn.com/image/fetch/$s_!7anz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf5062dd-52b1-4f76-8295-1c7ac107bc1e_1055x436.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Next, pull your repo in from GitHub and verify:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0ssO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0ssO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png 424w, https://substackcdn.com/image/fetch/$s_!0ssO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png 848w, https://substackcdn.com/image/fetch/$s_!0ssO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png 1272w, https://substackcdn.com/image/fetch/$s_!0ssO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0ssO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png" width="778" height="185" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:185,&quot;width&quot;:778,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0ssO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png 424w, https://substackcdn.com/image/fetch/$s_!0ssO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png 848w, https://substackcdn.com/image/fetch/$s_!0ssO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png 1272w, https://substackcdn.com/image/fetch/$s_!0ssO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc5ac95a-512e-4cb1-8bba-cf102b18bed8_778x185.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Finally, build your application package and application from the git repo. Make sure to build using the path to your app if not at the root of the repo:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_P_U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_P_U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png 424w, https://substackcdn.com/image/fetch/$s_!_P_U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png 848w, https://substackcdn.com/image/fetch/$s_!_P_U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png 1272w, https://substackcdn.com/image/fetch/$s_!_P_U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_P_U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png" width="959" height="233" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:233,&quot;width&quot;:959,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_P_U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png 424w, https://substackcdn.com/image/fetch/$s_!_P_U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png 848w, https://substackcdn.com/image/fetch/$s_!_P_U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png 1272w, https://substackcdn.com/image/fetch/$s_!_P_U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F777a97e0-8c77-4b33-bbb6-bcf89720b76c_959x233.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><h3>Required Components</h3><p><strong>setup_script.sql</strong>: A script that creates objects within the application object that are required by the app. It runs when a consumer installs or upgrades the app, or when the provider creates/upgrades the application object for testing the package.</p><p><strong>manifest.yml</strong>: Contains configuration information and the setup_script.sql path, to be used by the application package to create and manage a Snowflake app.</p><p><strong>environment.yml (required if using Streamlit)</strong>: Lists dependencies and versions used within the app. The channel must be set to Snowflake, and it&#8217;s recommended to specify Streamlit versions here.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LylG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LylG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png 424w, https://substackcdn.com/image/fetch/$s_!LylG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png 848w, https://substackcdn.com/image/fetch/$s_!LylG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png 1272w, https://substackcdn.com/image/fetch/$s_!LylG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LylG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png" width="512" height="92" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:92,&quot;width&quot;:512,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LylG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png 424w, https://substackcdn.com/image/fetch/$s_!LylG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png 848w, https://substackcdn.com/image/fetch/$s_!LylG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png 1272w, https://substackcdn.com/image/fetch/$s_!LylG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206b2f77-98b6-4256-9448-6497459c3fd6_512x92.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Optionally, you might also consider a marketplace.yml file to prevent the user from ending up with an unusable app by validating region resources before install/upgrade. It also adds resource requirements to the marketplace listing.</p><p></p><h3>Permission Management</h3><p>&#8220;<em>As per our enforced requirements, all account-level privileges and references listed in the application package manifest file must be requested from the consumer through Snowsight or the Python Permission SDK. The <a href="https://docs.snowflake.com/en/developer-guide/native-apps/requesting-privs#privileges-the-provider-can-request-from-the-consumer">account level privileges</a> listed in the manifest should be requested via the <a href="https://docs.snowflake.com/en/developer-guide/native-apps/requesting-ui#request-privileges-from-the-consumer">Permissions SDK in a Streamlit</a>. You can do this by following these <a href="https://docs.snowflake.com/en/developer-guide/native-apps/requesting-ui#workflow-for-creating-an-interface-to-approve-privileges-and-bind-references">steps</a>. </em>&#8221;</p><ul><li><p>Snowflake Marketplace Operations Rep (via submission review email)</p></li></ul><p>Certain app functionality requires explicit privileges to be granted to the app by the consumer, such as reading or modifying consumer data. These privileges must be listed in the manifest, and also be requested by the app. The Permissions SDK is used to easily embed permission requests in the Streamlit UI. Here&#8217;s a function you can call directly within your Streamlit entrypoint file. Just be sure to import Snowflake.permissions as permissions first.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eyOs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eyOs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png 424w, https://substackcdn.com/image/fetch/$s_!eyOs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png 848w, https://substackcdn.com/image/fetch/$s_!eyOs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png 1272w, https://substackcdn.com/image/fetch/$s_!eyOs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eyOs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png" width="1456" height="684" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:684,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eyOs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png 424w, https://substackcdn.com/image/fetch/$s_!eyOs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png 848w, https://substackcdn.com/image/fetch/$s_!eyOs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png 1272w, https://substackcdn.com/image/fetch/$s_!eyOs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4728f0-d684-49a5-a1d3-e8c9a8804911_1600x752.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>Application Versioning</h3><p>Snowflake Native Apps support versioning through application packages, allowing developers to maintain and upgrade apps safely. Best practices include keeping at most two active versions at a time, using versioned schemas for stateless code and regular schemas for persistent data, and ensuring backward compatibility between versions. Release directives can be used to control which version consumers install, and upgrades should be coordinated carefully to avoid disrupting active users.</p><p></p><h3>Frontend Development: Streamlit</h3><p>While it&#8217;s not technically required for a Native App to have a UI, we opted for one. In Snowflake Native, that means Streamlit. Now, Streamlit is great for what it is, allowing devs to quickly spin up simple frontends for demoing or internal uses. When the goal is to expose a feature-rich product for a marketplace, Streamlit is not the tool I&#8217;d usually reach for. Additionally, Snowflake Native overrides or simply does not support many of Streamlit&#8217;s features. Some are documented, and some aren&#8217;t. The documented list is linked at the end of the article. It&#8217;s worth reviewing before writing any code, as it will likely change the approach to the application or even affect its feasibility entirely.</p><p>A few of the undocumented issues I encountered:</p><ul><li><p>Background and element styling: Background styling just does not work. I assume this is due to Snowflake&#8217;s own internal light/dark theming. Styling on other elements may or may not stick. Lastly, text color is something that shouldn&#8217;t be touched because Snowflake&#8217;s light/dark themes include appropriate contrasting text colors. If they&#8217;re changed, those colors will be static and must be tested on both light and dark backgrounds for visibility.</p></li><li><p>Streamlit&#8217;s st.navigation doesn&#8217;t work, so for a multipage app the old /pages approach must be used. This means nested, folder-based navigation is not possible.</p></li></ul><p></p><h3>Backend Development</h3><p>The current backend for <a href="https://app.snowflake.com/marketplace/listing/GZTSZ1DS0HH/twing-data-twing-pulse">Twing Pulse</a> is a set of queries set within the app/code_artifacts/Streamlit/libraries directory. To leverage these queries to return data, we use Snowpark sessions, which can be created/retrieved like this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9l6v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9l6v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png 424w, https://substackcdn.com/image/fetch/$s_!9l6v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png 848w, https://substackcdn.com/image/fetch/$s_!9l6v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png 1272w, https://substackcdn.com/image/fetch/$s_!9l6v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9l6v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png" width="1248" height="263" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:263,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9l6v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png 424w, https://substackcdn.com/image/fetch/$s_!9l6v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png 848w, https://substackcdn.com/image/fetch/$s_!9l6v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png 1272w, https://substackcdn.com/image/fetch/$s_!9l6v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d8c634d-d02f-48e0-9986-d2efd079198e_1248x263.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The session returned from this function is used throughout the app to run queries and return data in pandas dataframes like this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rsk2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rsk2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png 424w, https://substackcdn.com/image/fetch/$s_!Rsk2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png 848w, https://substackcdn.com/image/fetch/$s_!Rsk2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png 1272w, https://substackcdn.com/image/fetch/$s_!Rsk2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rsk2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png" width="1172" height="252" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:252,&quot;width&quot;:1172,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rsk2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png 424w, https://substackcdn.com/image/fetch/$s_!Rsk2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png 848w, https://substackcdn.com/image/fetch/$s_!Rsk2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png 1272w, https://substackcdn.com/image/fetch/$s_!Rsk2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d46d322-7a6a-4a5f-9c77-844e404602b4_1172x252.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The current version of <a href="https://app.snowflake.com/marketplace/listing/GZTSZ1DS0HH/twing-data-twing-pulse">Twing Pulse</a> gets slightly more complex from here with query parsing and recursive traversal logic used for query pattern detection, for example. But much deeper complexity is available, from native user defined functions (UDFs) and stored procedures, all the way to external functions, which are developed, maintained, stored, and executed outside of Snowflake&#8217;s environment. We&#8217;re early in our journey in Snowflake Native, and our exploration is ongoing. Perhaps as Native is developed further and we continue to explore, we will someday use external functions to natively connect our fully-featured Twing Data product for select clients.</p><p></p><h3>Publishing</h3><p>Once the app&#8217;s ready for the world, it&#8217;s time to publish. If the app is for internal use only, publishing offers systematic versioning and streamlined distribution throughout the org, as well as simpler access control. And, obviously, if it&#8217;s destined for the public marketplace, publishing is required. Public publishing has some additional steps, so I&#8217;ll cover this pathway here.</p><p>If you recall from the Mental Model section, it was shown that the application package, and not the application itself, is what is distributed to consumers. The downstream application that was used for development is just for that, and has nothing to do with the publishing process. From here we&#8217;ll only be referring to the application package itself.</p><p>The next step is to set package distribution to external:</p><p>&#8216;ALTER APPLICATION PACKAGE &lt;package_name&gt; SET DISTRIBUTION = EXTERNAL;&#8217;</p><p>This will likely prompt you to accept some terms of service. Going forward, with this setting, every time the application package is updated, an automatic security scan kicks off. Completion on our app typically took several seconds to several minutes. Security scan status is located in the review_status column in the results from the query:<br>&#8216;SHOW VERSIONS IN APPLICATION PACKAGE &lt;package_name&gt;;&#8217;<br><br>If the current version/patch shows &#8216;REJECTED&#8217;, you have some work to do. Ours failed many times during development, and it was never clear why or where the error logs were located. Luckily removing a few lines from the startup_script.sql file corrected the issue for us. Worst case scenario, the docs say that upon failure of the automatic review, the package goes into manual review, which can take days to complete.</p><p>When the review_status column starts saying, &#8216;ACCEPTED&#8217;, you&#8217;re in the clear to proceed.</p><p>Next, you must set default and, if necessary, custom release directives for the app package. A release directive represents the version of the application that&#8217;s available to consumers. Custom release directives can be assigned per consumer if desired. You&#8217;ll find those queries at the bottom of this useful worksheet screenshot showing all the queries you&#8217;ll need work through this process:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HNtc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HNtc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png 424w, https://substackcdn.com/image/fetch/$s_!HNtc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png 848w, https://substackcdn.com/image/fetch/$s_!HNtc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png 1272w, https://substackcdn.com/image/fetch/$s_!HNtc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HNtc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png" width="1184" height="673" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c578185-9777-486e-b046-da1f6395296f_1184x673.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:673,&quot;width&quot;:1184,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HNtc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png 424w, https://substackcdn.com/image/fetch/$s_!HNtc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png 848w, https://substackcdn.com/image/fetch/$s_!HNtc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png 1272w, https://substackcdn.com/image/fetch/$s_!HNtc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c578185-9777-486e-b046-da1f6395296f_1184x673.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>All versions to publish, defined by release directives, must pass the entire review process. In addition to the security scan, this involves an in-depth manual review upon publishing.</p><p>Now that we&#8217;re ready to submit for manual review, in the Snowsight UI, in the lefthand navbar, go to Data Products -&gt; Provider Studio. In the top right you&#8217;ll see a blue button that says &#8216;Create Listing&#8217;. Here you can select whether to publish to the Internal or Public Marketplaces or for another specific consumer. Then name your app and select the type and availability:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bCH8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bCH8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png 424w, https://substackcdn.com/image/fetch/$s_!bCH8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png 848w, https://substackcdn.com/image/fetch/$s_!bCH8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png 1272w, https://substackcdn.com/image/fetch/$s_!bCH8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bCH8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png" width="760" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:760,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bCH8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png 424w, https://substackcdn.com/image/fetch/$s_!bCH8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png 848w, https://substackcdn.com/image/fetch/$s_!bCH8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png 1272w, https://substackcdn.com/image/fetch/$s_!bCH8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb52ed5b4-ae5b-48f5-9b0a-5427ab47e108_760x484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Next, follow the prompts, assigning the intended application package from the list of packages with release directives, then filling in the details in the UI as required.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7wXd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7wXd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png 424w, https://substackcdn.com/image/fetch/$s_!7wXd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png 848w, https://substackcdn.com/image/fetch/$s_!7wXd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png 1272w, https://substackcdn.com/image/fetch/$s_!7wXd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7wXd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png" width="726" height="254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:254,&quot;width&quot;:726,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7wXd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png 424w, https://substackcdn.com/image/fetch/$s_!7wXd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png 848w, https://substackcdn.com/image/fetch/$s_!7wXd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png 1272w, https://substackcdn.com/image/fetch/$s_!7wXd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F859be9c6-f900-4e2f-b802-5ddfe4f60fc7_726x254.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When all required sections and fields are filled in, the publish button at the top will become activated. Clicking that button will turn it yellow with a message signifying that it is under review.</p><p>Congratulations! The app is submitted!<br></p><h3>The Review Process</h3><p>The initial feedback took approximately 5 days to arrive. The review made clear that Snowflake is taking the process seriously. The issues they surfaced covered code, privilege permissions, and UI. While the message included many directives, there were also questions that indicated that they respected that the context affects the appropriateness of certain decisions.</p><p>Snowflake is prompt with replies after the process is started, and requires by policy that the publisher is also. I found the Snowflake contact to be warm, engaging, informative, and collaborative. A great first experience, to be sure!</p><p></p><h3>That&#8217;s a Wrap</h3><p>Developing for the Snowflake Native ecosystem makes a whole lot of sense for the right application. For instance, building a product destined to run within the isolated consumer account alleviates the need to worry about infrastructure. On the other hand, its intentionally-restrictive nature greatly limits the types of solutions possible, trading instead for greater security/governance.</p><p>We have a ways to go in exploring the full capabilities of the framework. While we&#8217;ve found it takes some rethinking and compromising to fit our existing approaches into this new model, we believe Snowflake Native is likely a simple way to natively expose our features to clients while protecting our IP. <a href="https://app.snowflake.com/marketplace/listing/GZTSZ1DS0HH/twing-data-twing-pulse">Twing Pulse</a> in its current state is merely a test. As this thesis is validated, much more of our full product features will be on the way. If you&#8217;re a Snowflake user and are interested in deeper observability to connect how your system is actually used with Snowflake&#8217;s execution and pricing behaviors, give <a href="https://app.snowflake.com/marketplace/listing/GZTSZ1DS0HH/twing-data-twing-pulse">Twing Pulse</a> a download. It&#8217;s free! All we ask is that you send some feedback to us so we can better prioritize the highest-impact improvements and feature migrations.</p><p>For an idea of what&#8217;s to come to <a href="https://app.snowflake.com/marketplace/listing/GZTSZ1DS0HH/twing-data-twing-pulse">Twing Pulse</a>, reach out to us for demo links to our full platform at twingdata.com.</p><p>Have you developed for the Snowflake Native Marketplace? What was your experience? Let us know below!</p><p></p><h3>Some Useful Docs Sources:</h3><p>Snowflake unsupported Streamlit features: <a href="https://docs.snowflake.com/en/developer-guide/native-apps/adding-streamlit">https://docs.Snowflake.com/en/developer-guide/native-apps/adding-Streamlit</a></p><p><a href="https://docs.snowflake.com/en/developer-guide/streamlit/limitations#unsupported-streamlit-features">https://docs.Snowflake.com/en/developer-guide/Streamlit/limitations#unsupported-Streamlit-features</a></p><p>Privileges:</p><p><a href="https://docs.snowflake.com/en/user-guide/security-access-control-privileges">https://docs.Snowflake.com/en/user-guide/security-access-control-privileges</a></p><p>Functions and Procedures:</p><p><a href="https://docs.snowflake.com/en/developer-guide/extensibility">https://docs.Snowflake.com/en/developer-guide/extensibility</a></p><p>External Functions:</p><p><a href="https://docs.snowflake.com/en/sql-reference/external-functions-introduction">https://docs.Snowflake.com/en/sql-reference/external-functions-introduction</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Make API Data Engineering Fun with DuckDB]]></title><description><![CDATA[An example of using DuckDB to simplify the often frustrating API to database workflow]]></description><link>https://blog.twingdata.com/p/make-api-data-engineering-fun-with</link><guid isPermaLink="false">https://blog.twingdata.com/p/make-api-data-engineering-fun-with</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Fri, 04 Apr 2025 15:12:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HJzX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HJzX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HJzX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!HJzX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!HJzX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!HJzX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HJzX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1413353,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/160583266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HJzX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!HJzX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!HJzX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!HJzX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89204eed-3c84-44ae-a572-9f736dfa79b2_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These days so much data engineering work is paging through an API, parsing the responses, storing them in a database, and then joining them with other data to build an actually useful product.</p><p>It&#8217;s not glamorous work. In fact, it&#8217;s a grind dealing with the various APIs and figuring out the right incantation of parameters to get the data you want. Not to mention dealing with failures, rate limiting, undocumented &#8220;features,&#8221; and everything else under the sun.</p><p>And after you have the data, you have to parse the results and transform it iso it&#8217;s easy to work with, and join with the rest of the data you have.</p><p>What&#8217;s worked for us is integrating <a href="https://duckdb.org/">DuckDB</a> into our analysis workflows. For those unfamiliar, it&#8217;s akin to running a small data warehouse within your script. It has a standalone mode, but you can also simply &#8220;import&#8221; it into a Python script and interact with it with SQL just like you would with any database. It&#8217;s basically an OLAP version of sqlite.</p><p>A sample project we&#8217;ve done is to fetch some data from a customer&#8217;s database and then hit the vendor API with those ids to enrich the data. This vexing workflow becomes a lot simpler (and maybe even fun) with DuckDB.</p><p>The basic idea is to just dump the API responses as JSON into a directory and then use DuckDB to quickly query the semi-structured data.</p><p>Here&#8217;s the basic folder structure:</p><p>/script.py<br>/data/data_type1/file1.json<br>/data/data_type1/file2.json</p><h2>Steps:</h2><ol><li><p>Pull the ids you want to enrich</p></li></ol><p>This can be from the database or another API. The key thing is that this is just a simple list without the details since those need to be retrieved via a separate API call.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wViq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wViq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png 424w, https://substackcdn.com/image/fetch/$s_!wViq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png 848w, https://substackcdn.com/image/fetch/$s_!wViq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png 1272w, https://substackcdn.com/image/fetch/$s_!wViq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wViq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png" width="467" height="261" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:261,&quot;width&quot;:467,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40522,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/160583266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wViq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png 424w, https://substackcdn.com/image/fetch/$s_!wViq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png 848w, https://substackcdn.com/image/fetch/$s_!wViq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png 1272w, https://substackcdn.com/image/fetch/$s_!wViq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e97ca00-2c96-4987-8f22-efa70241539c_467x261.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol start="2"><li><p>Pull the ids you already pulled from the API</p></li></ol><p>We want to be efficient and not refetch data we already have. In this case we can use DuckDB and do something like <code>SELECT distinct(id) from read_json_auto(&#8220;/data/data_type1/*json&#8221;);</code> to get the unique set of values we already have data for.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D8SO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D8SO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png 424w, https://substackcdn.com/image/fetch/$s_!D8SO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png 848w, https://substackcdn.com/image/fetch/$s_!D8SO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png 1272w, https://substackcdn.com/image/fetch/$s_!D8SO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D8SO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png" width="775" height="155" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:155,&quot;width&quot;:775,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29571,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/160583266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D8SO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png 424w, https://substackcdn.com/image/fetch/$s_!D8SO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png 848w, https://substackcdn.com/image/fetch/$s_!D8SO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png 1272w, https://substackcdn.com/image/fetch/$s_!D8SO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db80966-ac63-494b-83a0-c38ad73a0400_775x155.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ol start="3"><li><p>Compare the two and fetch the new data</p></li></ol><p>Generate the set difference between steps 1 and 2 above to identify the new data we need to fetch and then run the script to fetch the data from the APIs. The result of this can be dumped into the data folder above as simple JSON without worrying about parsing or any transformations. Just dump it into the folder. For larger tasks you can also parallelize the calls to run through the fetches quicker.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L4fa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L4fa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png 424w, https://substackcdn.com/image/fetch/$s_!L4fa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png 848w, https://substackcdn.com/image/fetch/$s_!L4fa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png 1272w, https://substackcdn.com/image/fetch/$s_!L4fa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L4fa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png" width="468" height="518" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4da84393-3601-4069-a7dc-98218065d96e_468x518.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:518,&quot;width&quot;:468,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85863,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/160583266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L4fa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png 424w, https://substackcdn.com/image/fetch/$s_!L4fa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png 848w, https://substackcdn.com/image/fetch/$s_!L4fa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png 1272w, https://substackcdn.com/image/fetch/$s_!L4fa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da84393-3601-4069-a7dc-98218065d96e_468x518.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><ol start="4"><li><p>Use DuckDB to bulk read the JSON data</p></li></ol><p>Now this is where DuckDB shines. We can use the &#8220;read_json_auto&#8221; function to convert the semi-structured JSON data into a proper table we can query via SQL. If it&#8217;s a small project you can just query it directly. Otherwise, you can persist it and make it joinable with other data sets.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZfOY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZfOY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png 424w, https://substackcdn.com/image/fetch/$s_!ZfOY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png 848w, https://substackcdn.com/image/fetch/$s_!ZfOY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png 1272w, https://substackcdn.com/image/fetch/$s_!ZfOY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZfOY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png" width="1146" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:1146,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91828,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/160583266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZfOY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png 424w, https://substackcdn.com/image/fetch/$s_!ZfOY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png 848w, https://substackcdn.com/image/fetch/$s_!ZfOY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png 1272w, https://substackcdn.com/image/fetch/$s_!ZfOY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12dd72fe-d2fe-491d-b05b-f1223e2e45f0_1146x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol start="5"><li><p>Profit</p></li></ol><p>It&#8217;s an incredibly powerful tool and worth considering as part of any data transformation workflow. Dealing with APIs and polling and transformations is never fun &#8211; but DuckDB makes these workflows much better.</p><div><hr></div><p>Full code <a href="https://gist.github.com/dangoldin/f45016fa2684ab2bb15741d2b83cc260">here</a>.</p>]]></content:encoded></item><item><title><![CDATA[Extracts, PDTs, and Embedded DBs: How BI Tools Shape Data Warehouse Economics]]></title><description><![CDATA[Understanding these caching differences can help optimize query performance and control expenses, especially in consumption-based warehouses like Snowflake]]></description><link>https://blog.twingdata.com/p/extracts-pdts-and-embedded-dbs-how</link><guid isPermaLink="false">https://blog.twingdata.com/p/extracts-pdts-and-embedded-dbs-how</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Wed, 26 Mar 2025 15:58:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A common trap we&#8217;ve fallen into is to separate the cost of the data warehouse and their associated business intelligence tools. It&#8217;s natural to think of them as separate, but the workings of a BI tool can have a major impact on your data warehouse operations and the associated bill.</p><p>This became clear to us when we spoke to someone who used <a href="https://www.strategysoftware.com/">Microstrategy</a> (it&#8217;s not just Bitcoin) as their BI tool. It&#8217;s not the hippest tool, but what it does well is a concept called &#8220;extracts&#8221; &#8211; a way of caching the underlying data. Most enterprise-focused tools offer the same functionality:</p><ul><li><p><a href="https://www.tableau.com/">Tableau</a> and <a href="https://www.microsoft.com/en-us/power-platform/products/power-bi">Power BI</a> (traditional BI apps that offer a desktop client) store some data on the client, which avoids the need to hit a database on every request</p></li><li><p><a href="https://cloud.google.com/looker?hl=en">Looker</a> offers similar functionality via their PDTs, but the PDTs are stored within the database rather than on the client</p></li><li><p>Newer tools, such as <a href="https://omni.co/">Omni</a> and <a href="https://www.rilldata.com/">Rill</a>, improve on this functionality by embedding <a href="https://duckdb.org/">DuckDB</a> in their webclient, or in a dedicated proxy layer</p></li></ul><p>These sorts of decisions end up affecting the overall cost and performance of the BI system, especially consumption based warehouses like Snowflake.</p><p>One dimension to think about here is how intelligent the caching is, and where the cache is stored. The more intelligent the caching, the fewer requests your system will need to make back to the data warehouse &#8211; but there&#8217;s a limit based on the memory available. An aggressive caching strategy will take a superset of dimensions and metrics and store them client side in case they need to be pulled. This will increase the hit rate of subsequent requests, but requires more memory. A less aggressive caching strategy will reduce the memory footprint but reduce the hit rate. The ideal solution here finds the optimal balance based on predicted query behavior.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KKfE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KKfE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png 424w, https://substackcdn.com/image/fetch/$s_!KKfE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png 848w, https://substackcdn.com/image/fetch/$s_!KKfE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png 1272w, https://substackcdn.com/image/fetch/$s_!KKfE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KKfE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png" width="694" height="664" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:664,&quot;width&quot;:694,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KKfE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png 424w, https://substackcdn.com/image/fetch/$s_!KKfE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png 848w, https://substackcdn.com/image/fetch/$s_!KKfE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png 1272w, https://substackcdn.com/image/fetch/$s_!KKfE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F455ee022-5aae-4d6f-a43d-2952f4e6b4d2_694x664.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The other dimension is how the extract or cached data is stored. As mentioned, Looker will keep it in the underlying data warehouse while others may keep it in an intermediate cache while others may keep it on the client. Tools like <a href="https://www.greybeam.ai/">Greybeam</a> can intercept your queries and route them to an internal cache and compute layer if they deem it will provide a cost-effective performance boost.</p><p>You can imagine an optimal DW/BI system looking like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v-Ql!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v-Ql!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png 424w, https://substackcdn.com/image/fetch/$s_!v-Ql!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png 848w, https://substackcdn.com/image/fetch/$s_!v-Ql!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png 1272w, https://substackcdn.com/image/fetch/$s_!v-Ql!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v-Ql!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png" width="981" height="666" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:666,&quot;width&quot;:981,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v-Ql!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png 424w, https://substackcdn.com/image/fetch/$s_!v-Ql!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png 848w, https://substackcdn.com/image/fetch/$s_!v-Ql!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png 1272w, https://substackcdn.com/image/fetch/$s_!v-Ql!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa670f9ff-7763-4c59-a288-38995e35b7f7_981x666.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Your data warehouse and BI tools aren&#8217;t standalone components. The most effective implementations treat the data warehouse and BI tool as a unified system, where each component's strengths amplify the other's performance.</p>]]></content:encoded></item><item><title><![CDATA[Your Data Warehouse is a Hammer — But Not Everything is a Nail]]></title><description><![CDATA[At scale, specialized workloads demand specialized tools beyond a single data warehouse]]></description><link>https://blog.twingdata.com/p/your-data-warehouse-is-a-hammer-but</link><guid isPermaLink="false">https://blog.twingdata.com/p/your-data-warehouse-is-a-hammer-but</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Wed, 19 Mar 2025 18:08:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6hQj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6hQj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6hQj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png 424w, https://substackcdn.com/image/fetch/$s_!6hQj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png 848w, https://substackcdn.com/image/fetch/$s_!6hQj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png 1272w, https://substackcdn.com/image/fetch/$s_!6hQj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6hQj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png" width="919" height="564" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21d95cc7-780a-426a-9202-de749969a470_919x564.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:564,&quot;width&quot;:919,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82922,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/159428595?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6hQj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png 424w, https://substackcdn.com/image/fetch/$s_!6hQj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png 848w, https://substackcdn.com/image/fetch/$s_!6hQj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png 1272w, https://substackcdn.com/image/fetch/$s_!6hQj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21d95cc7-780a-426a-9202-de749969a470_919x564.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Imagine a company that&#8217;s collecting tons of data points per day. They start with a traditional OLTP database, but inevitably realize they need OLAP for their analytics workloads &#8211; so they start using a data warehouse.</p><p>That means they inevitably start building more and more workloads on top of the data warehouse:</p><ul><li><p>The product and application teams are running queries to incorporate reporting data into the customer facing dashboards.</p></li><li><p>Data science is querying raw events to feed into their data science models.</p></li><li><p>The analytics team is using dbt to do additional transformation and Looker to build dashboards for the executives and internal teams.</p></li><li><p>The data team is using it to do complex data transformations to support the use cases above.</p></li><li><p>The finance team is extracting data using some arcane Looker report into PowerBI</p></li></ul><p>All teams and stakeholders are using the same data tool &#8211; that&#8217;s the dream, right? Well, in our experience, this setup will only take you so far. At a certain point, you&#8217;ll be better served by picking the right tool(s) for the job.</p><p>Snowflake, or basically any other data warehouse, acts as a pretty good initial data transformation layer &#8211; they do make it dead easy to get up and running with a large data warehouse. But the tradeoff is that it will be expensive unless you finely tune it, and it&#8217;ll be quite complex, since you&#8217;ll likely have multiple stakeholders trying to achieve different ends. It&#8217;s akin to using the same car to cart a family to weekend sports, then go offroading in the Sierra Nevada, to then race Formula One.</p><p>Additionally, it won&#8217;t give you the same low-level control that another tool would. Spark, for example, tends to be better for a very large data task that you really do need to optimize at a low level, or working with data scientists that do want to bring in a lot more custom logic or custom libraries. Clickhouse, on the other hand, is designed for this very fast querying, but it's limited because it's designed not for joins or any complex SQL queries, but for very fast queries against time series data.</p><p>For the example above, the solution may be to keep Snowflake for the last mile of reports that don&#8217;t have subsecond performance requirements. And introduce Spark to handle the initial expensive data transformations and data science modeling tasks. And if you&#8217;re feeling good, introduce Clickhouse and a tool such as Rill for the interactive internal analytics.</p><p>There&#8217;s no one size fits all solution in big data, as much as we may want it.</p><p>That&#8217;s why we&#8217;re always reminding clients to be intentional about what they&#8217;re using the data for, in order to implement the stack that works best for their needs. We&#8217;re also excited about tools like Apache Iceberg and the trend to &#8220;bring your own compute&#8221; to the data at hand. Using Iceberg allows you to keep the data neutral, and not locked into a single warehouse. Instead, you're able to pick whatever warehouse and compute layer you need to do the data processing.</p><p>If you&#8217;re beginning to feel pain from a single-warehouse data stack, start with an audit of workloads, and analyze each function's cost-performance dynamics. What are the actual use cases and workloads that your warehouse is powering? For example, maybe data transformation is responsible for 40% of your data budget. Maybe BI accounts for another 20%. Make sure to weigh the costs, in terms of both infrastructure and people, against the relative value of the output. After you have a good sense of that break, you&#8217;ll be better positioned to look into more specialized solutions that can efficiently handle the specific workflows. And that means getting to work with the precision of a scalpel, rather than blindly hammering away.</p><p>Unsure of where to start? We&#8217;re here to help &#8211; just hit reply to get in touch.</p>]]></content:encoded></item><item><title><![CDATA[Warehouse Pricing: Snowflake]]></title><description><![CDATA[The real cost of Snowflake and how to reduce it.]]></description><link>https://blog.twingdata.com/p/warehouse-pricing-snowflake-now-with</link><guid isPermaLink="false">https://blog.twingdata.com/p/warehouse-pricing-snowflake-now-with</guid><dc:creator><![CDATA[Simon]]></dc:creator><pubDate>Thu, 13 Mar 2025 16:01:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!92pB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!92pB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!92pB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif 424w, https://substackcdn.com/image/fetch/$s_!92pB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif 848w, https://substackcdn.com/image/fetch/$s_!92pB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif 1272w, https://substackcdn.com/image/fetch/$s_!92pB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!92pB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59318,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://simontwing.substack.com/i/158996988?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!92pB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif 424w, https://substackcdn.com/image/fetch/$s_!92pB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif 848w, https://substackcdn.com/image/fetch/$s_!92pB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif 1272w, https://substackcdn.com/image/fetch/$s_!92pB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dfe3ead-7a4d-4b58-a5a2-ef8d4045b65f_2670x1780.avif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image by <a href="https://unsplash.com/@blankerwahnsinn">Fabian Blank</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p>[Note: this post corrects a previous version that had a few placeholders instead of links]</p><p>Previously, we explored the pricing models of <a href="https://blog.twingdata.com/p/the-real-cost-of-bigquery">BigQuery</a> and <a href="https://blog.twingdata.com/p/the-real-cost-of-redshift-warehousing">Redshift</a>  (and <a href="https://blog.twingdata.com/p/redshift-serverless-pricing">Serverless Redshift</a>). In this article, we'll look at Snowflake's pricing models and optimizations that can be made to both reduce cost (which can be significant!) and improve performance.</p><p>Snowflake offers a wide range of features which gives us a lot of options when it comes to reducing cost and improving performance. However, many of these options come with pros and cons that vary from use case to use case, so we&#8217;ll also look at ways to ensure these tradeoffs are well-informed.</p><p>Since these tradeoffs can be complex, a good way to get started is with some proven best practices. We&#8217;ve compiled some below, tried and tested with our client base, to help you get started.</p><h2>Understanding Snowflake&#8217;s pricing model</h2><p>Snowflake&#8217;s flexible pricing model is mainly based on three components: Compute, Storage, and Cloud Services. Snowflake offers contracts with reduced pricing, or &#8220;Capacity pricing&#8221;. Some more advanced features may incur additional costs.</p><p>The prices below are as of March 1, 2025 and more details can be found on Snowflake&#8217;s <a href="https://www.snowflake.com/legal-files/CreditConsumptionTable.pdf">Service Consumption Table</a> (pdf). More general pricing information can be found on their <a href="https://www.snowflake.com/en/pricing-options/">Pricing Options page</a>.</p><h3>Compute</h3><p>Snowflake offers a credit-based pricing model in which different sized warehouses cost different amounts of credits:</p><ul><li><p>X-Small: <strong>1 credit/hour</strong></p></li><li><p>Small: <strong>2 credits/hour</strong></p></li><li><p>Medium: <strong>4 credits/hour</strong></p></li><li><p>Large: <strong>8 credits/hour</strong></p></li><li><p>&#8230; and so on up to 6XL (scales exponentially)</p></li></ul><p>The cost per credit varies between about $2 and over $6 depending on the plan, region, and platform (AWS, Azure, and GCP are currently supported).</p><p>To save compute resources, warehouses auto-suspend when not in use.</p><h3>Storage</h3><p>Storage is charged separately from Compute and is billed by TB per month. This cost varies by plan and region but ranges between $23 and $46 per TB/month.</p><p>Data is compressed to reduce actual storage cost.</p><h3>Cloud Services</h3><p>Snowflake's cloud services are priced using a fair-use model. Some features incur additional cost such as SSO and query optimization. These fees are typically small and included in compute costs, as long as cloud services usage doesn't exceed 10% of compute costs.</p><h3>Tiers</h3><ul><li><p>On-Demand</p><ul><li><p>Pay-as-you-go model</p></li></ul></li><li><p>Capacity</p><ul><li><p>Pre-purchased credits: per-credit rates may be reduced by 30+%</p></li><li><p>Storage costs can also be reduced</p></li></ul></li></ul><h3>Other Costs</h3><ul><li><p>Charges apply for transferring data to external services and between regions.</p></li><li><p>Snowflake doesn&#8217;t charge an ingress fee, but other services might have their own egress fees.</p></li><li><p>Some other features are not included in the &#8220;fair-use&#8221; model: Snowpipe, Materialized View Maintenance, Search Optimization.</p></li></ul><h2>Realistic pricing breakdown example</h2><p>These rates are reasonable but they can add up quickly. Consider this real-world example: BigDataCo, an Ad Tech company, needs to process real-time ad impressions, clickstream data, and audience segmentation. BigDataCo&#8217;s employees regularly run complex queries on historical data to optimize ad targeting, measure campaign performance, and refine audience segmentation strategies.</p><p><strong>Data Volume:</strong> ~300 TB stored</p><p><strong>Users:</strong> 25 data engineers, 100 analysts, automated workloads</p><p><strong>Compute Needs:</strong> Multiple warehouses (Small &amp; Large), Snowpipe for streaming data ingestion</p><ul><li><p>Storage</p><ul><li><p>Let&#8217;s say we want to store 300TB (and this doesn&#8217;t change). At $23/TB/month, this comes out to <strong>$6,900/month</strong>.</p></li></ul></li><li><p>Compute</p><ul><li><p>Let&#8217;s assume ETL runs a large warehouse for 240 hours/month, analysts run 240 hours worth of medium warehouse compute, and automated workloads run small warehouses for a monthly total of 120 hours.</p></li><li><p>Adding this all up, we get (240 * 8) + (240 * 4) + (120 * 2) for a total of 3,120 credits.</p></li><li><p>Billed at $3/credit, this comes out to <strong>$9,360/month</strong>.</p></li></ul></li><li><p>Cloud Services - Data transfer</p><ul><li><p>Snowflake does not provide specific rates but for this example we can assume it is negligible.</p></li></ul></li><li><p>Total</p><ul><li><p>$6,900 + $9,360 = <strong>$16,260/month</strong></p></li></ul></li></ul><h2>Reducing cost and improving performance</h2><p>There are plenty of ways to optimize warehouse cost and performance and this is by no means an exhaustive list. These are some ways we&#8217;ve reduced cost and improved efficiency for our customers.</p><h3>Capacity Pricing</h3><p>Using Capacity pricing can reduce the token costs by up to 30%. If we can accurately predict our spend for the example above, we can reduce our monthly spend to <strong>$13,452</strong> &#8211; an overall reduction of 17%! At larger spend, Capacity pricing can save on storage costs as well.</p><p>However, accurately predicting future usage (to avoid paying for a larger commitment than needed) requires understanding historical usage, workload patterns, and expected growth.</p><p>Breaking down previous usage patterns can give more accurate results. For example, if we expect our above ETL workloads to increase by 20%, analysis workloads to increase by 10%, and automated workloads to stay constant, we can calculate that our credit usage will increase from 3,120 to 3,600 (assuming the warehouse sizes don&#8217;t change). This can give a more accurate prediction than a more holistic approach would.</p><p>After receiving reduced rates, it&#8217;s important to set up regular reviews to compare expected and actual usage and adjust warehouse sizes, schedules, or commitment levels as needed.</p><h4>Optimizations</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O9Dd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O9Dd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png 424w, https://substackcdn.com/image/fetch/$s_!O9Dd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png 848w, https://substackcdn.com/image/fetch/$s_!O9Dd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!O9Dd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O9Dd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png" width="1456" height="972" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:972,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!O9Dd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png 424w, https://substackcdn.com/image/fetch/$s_!O9Dd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png 848w, https://substackcdn.com/image/fetch/$s_!O9Dd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!O9Dd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b5d48c2-ac18-48cb-b406-610c9dae1fa3_1600x1068.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image by <a href="https://unsplash.com/@icons8">Icons8 Team</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p>Snowflake operates on a few principles that are important to consider holistically. Unlike many other databases, Snowflake is highly optimized for analytics and uses a combination of micro-partitions and columnar storage to improve query performance and storage efficiency.</p><p>Micro-partitions are immutable storage units of about 50MB to 500MB (when uncompressed). Within each micro-partition, data is stored by column rather than by row, and each column (within each micro-partition) is stored together physically, which allows for better compression and improved read performance.</p><p>Understanding how the data is physically organized helps in identifying high-level performance bottlenecks by allowing us to visualize how a query might run: does it need to get data from lots of different columns and partitions (slow) or can we reasonably expect data to be co-located (fast)?</p><h4>Data Ingestion</h4><p>To get the most out of Snowflake&#8217;s compression, be mindful of how it will be fed into Snowflake storage from other sources.</p><p>If ingesting data from S3/cloud storage be thoughtful around the format. We found that zstd Parquet generates both the smallest files (cheap storage) and fastest ingestion (when using vectorized load).</p><p>See which columns are actually used and consider moving less frequently used columns to other tables &#8211; especially if that data has sensitive information that should have different retention rules. This way you can avoid unnecessarily reprocessing data.</p><p>Be careful of array types. They are often easier to store and query but query performance suffers. If there's a way of "exploding" the data to be in rows or lookup tables, you might be better off.</p><p>Lastly, take a close look at platform- and region-specific transfer pricing. Moving data across regions and between different platforms incurs additional costs compared to data moving within the same region and platform.</p><h4>Optimizing Columnar Data Storage</h4><p>Since Snowflake stores data by column, structural decisions go a long way. Here are some high-level improvements that can be done to keep Snowflake&#8217;s compression space-efficient:</p><p>Replace long strings (e.g. &#8220;United States&#8221;) with a numeric ID and lookup table (a &#8220;country_id&#8221; value of &#8220;US&#8221; maps to a lookup table &#8220;name&#8221; value of &#8220;United States&#8221;). This helps with query execution speed as well.</p><p>The bottom line is that Snowflake&#8217;s columnar storage benefits most from efficient table structuring and applying these strategies improves both compression efficiency and query performance, thereby reducing the overall cost regardless of plan, region, or whether it&#8217;s prepaid or pay-as-you-go.</p><h4>Warehouse Sizes</h4><p>Warehouses can be set to different sizes depending on the tasks they are responsible for. In our example above, we&#8217;re spending 1,920 credits/month on the Large ETL warehouse. Maybe there's an opportunity to reduce the warehouse during off-peak times when a large warehouse is over-sized. Alternatively, maybe queries will run slower but still be acceptable given the cost savings.</p><p>Snowflake clusters offer auto-scaling capabilities but they can be adjusted manually (or programmatically) if more optimization is needed. Adjusting warehouse sizes can be a quick way to save credits but making it too small can lead to slow or queued queries, so care should be taken.</p><h4>Data Retention</h4><p>A seemingly obvious improvement is to reduce how much is being stored - this might mean reducing duplication (e.g. with a lookup table) or clearing out old data or moving old data to a cheaper storage service like AWS S3, Iceberg, or Google Cloud Storage. That data can then be queried from Snowflake without having to import it. Snowflake also offers an <a href="https://docs.snowflake.com/en/user-guide/tables-external-intro">&#8220;external tables&#8221; feature</a> that allows you to store external storage metadata within Snowflake but the data itself in S3 or elsewhere.</p><p>A well-structured data retention policy balances cost, performance, and accessibility. The key is automating archival, using external tables for historical data, and monitoring storage usage to optimize costs.</p><p>These methods are simple on the surface but require careful implementation to ensure there are no side effects and any performance trade-offs are acceptable. Any long-term change like this should be modeled and their costs estimated before implementation.</p><h4>Idle Time</h4><p>There&#8217;s a startup and shutdown costs associated with warehouse usage. Idle warehouses are shut down automatically after a period of inactivity but they can also be shut down manually if they&#8217;re not expected to be used again for some time. One of our clients does exactly this because they have a predictable ETL workload, so turning the warehouse off after their tasks are done is an easy way for them to save on their bill.</p><h4>Query Batching</h4><p>Conversely to managing idle time, queries can be batched to minimize idle time. Since there&#8217;s a hidden cost associated with warehouses starting up and shutting down, queries can be batched to eliminate idle time, especially if warehouses are idle frequently without shutting down. It&#8217;s cheaper to run a warehouse at 100% capacity for an hour than 50% capacity for two hours.</p><p>Similarly, partitions are cached on the warehouse level, so if we have queries that read the same partitions, we can automatically speed queries up by running them on the same warehouse (assuming it&#8217;s not shutdown and spun back up in between queries).</p><h4>Query Optimizations</h4><p>Many of the topics covered above (especially in &#8220;Optimizing Columnar Data Storage&#8221;) improve query performance (and therefore reduce credit usage). Snowflake offers some additional features that can speed things up.</p><p><strong><a href="https://docs.snowflake.com/en/user-guide/views-materialized">Materialized views</a></strong> are pre-computed results that speed queries up, but are re-calculated when data changes. There&#8217;s a clear tradeoff between speed and storage cost and cost of re-calculation (or &#8220;refresh cost&#8221;), so this approach is best when there are frequent reads that depend on seldom-changing data. MVs can be thought of as memoization.</p><p><strong><a href="https://docs.snowflake.com/en/user-guide/dynamic-tables-intro">Dynamic tables</a></strong> are a newer feature with some key differences. Instead of refreshing when data changes, they refresh at a given interval. Additionally, instead of storing the query result, dynamic tables store a full snapshot of the data.</p><p>In our work, however, we&#8217;ve found that these features are better on paper than in practice. As always, the best way to get meaningful insights is to test, test, and test.</p><p>For a more in depth comparison, take a look at the breakdown below:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/btq64/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b651d14-840e-42b1-9da1-2104cc262dc5_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:505,&quot;title&quot;:&quot;Dynamic Tables vs Materialized Views&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/btq64/2/" width="730" height="505" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p><strong><a href="https://docs.snowflake.com/en/user-guide/tables-clustering-keys#what-is-a-clustering-key">Clustering keys</a></strong> allow you to physically partition rows of data, for example clustering by &#8220;date&#8221; improves performance when retrieving data in chronological order. By default, Snowflake inserts data randomly (due to how partitioning works without clustering keys), so this approach is great for large tables where queries scan too many partitions. Data can be re-clustered in the future as data is added, though doing so uses Compute credits, so make sure clustering is overall favorable beforehand!</p><p>As an example, consider a real-world example of a large table (let&#8217;s say it&#8217;s 500TB and contains 5 billion rows) that&#8217;s frequently filtered by order_date. Clustering by order_date can lead to dramatic results:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/TC0N6/3/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d49decb-037b-4b3c-96d9-a55eb9bd7573_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:372,&quot;title&quot;:&quot;With and Without Clustering Keys&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/TC0N6/3/" width="730" height="372" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>That&#8217;s over $90,000, or almost 85%, in annual savings &#8211;- more than enough for a few pizza parties! And remember, Capacity pricing can reduce compute costs further. Our friends at <a href="https://select.dev">select.dev</a> have written an <a href="https://select.dev/posts/snowflake-clustering">article dedicated to clustering</a> that has more details on the topic.</p><h2>Summary</h2><p>As you can see, there are a lot of levers to pull, each with its own pros and cons &#8211; as with any decision in software engineering. Previous history provides invaluable insight to these big decisions, and Twing Data can help you on your quest by letting you visualize historical usage and patterns.</p><p>In our experience, and as mentioned throughout this article, the best way to quantitatively measure any changes is to test and inspect the results. Twing Data would be happy to work with you to identify opportunities to optimize your datamart so you can have more pizza parties.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support our work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The three Snowflake queries big data doesn’t want you to know about]]></title><description><![CDATA[Leverage Snowflake&#8217;s metadata for big cost savings. It's all there, you just need to know where to look.]]></description><link>https://blog.twingdata.com/p/the-three-snowflake-queries-big-data</link><guid isPermaLink="false">https://blog.twingdata.com/p/the-three-snowflake-queries-big-data</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Thu, 06 Mar 2025 15:19:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GqO3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GqO3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GqO3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!GqO3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!GqO3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!GqO3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GqO3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2035801,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/158519721?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GqO3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!GqO3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!GqO3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!GqO3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4416f8b5-10a0-4896-8175-1e6b08b6534e_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We&#8217;ve been talking about some shortcomings of Snowflake recently (see <a href="https://blog.twingdata.com/p/same-query-different-hash-an-exploration">this post</a> about query hashing blind spots), so today we&#8217;re changing tacks to give some credit where credit is due. Snowflake actually does a great job capturing all sorts of query metadata that you can use to dig in to explore your warehouse usage. Below, we&#8217;re sharing three particular queries that have been incredibly helpful in helping us analyze our customers&#8217; data warehouses and identify optimization opportunities.</p><h2>Table size vs Table usage</h2><pre><code>with usage as (
   select
   obj.value:objectId as table_id,
   obj.value:objectName::string table_name,
   APPROX_COUNT_DISTINCT(query_id) as num_queries,
   APPROX_COUNT_DISTINCT(case when array_size(objects_modified) = 0 then query_id end) as num_select_queries,
   min(query_start_time) as first_usage,
   max(query_start_time) as last_usage
   from snowflake.account_usage.access_history,
   table(flatten(base_objects_accessed)) obj
   where query_start_time &gt;= '2025-02-01'
   group by all
)
select at.table_id, at.table_catalog, at.table_schema, at.table_name, at.table_owner, at.table_type, at.is_transient, at.clustering_key,
at.row_count, at.bytes / 1e6 as megabytes, at.created, at.last_altered, at.last_ddl,
at.table_catalog || '.' || at.table_schema || '.' || at.table_name as fully_qualified_name,
u.num_queries, u.num_select_queries, u.first_usage, u.last_usage
from (select *
     from snowflake.account_usage.TABLES
     where deleted is null
) at
left join usage u on at.table_id = u.table_id
order by at.bytes desc;</code></pre><p>Snowflake provides a neat view SNOWFLAKE.ACCOUNT_USAGE.TABLES that has all sorts of metadata about each table, ranging from size and number of rows, to the most recent time both the table structure and contents have been modified. But what makes this even more powerful is joining it against the SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY. This lets you see how often this table is actually being queried, and you can also see whether it&#8217;s the target of a changing operation (eg UPDATE, INSERT, etc) or not (eg SELECT).</p><p>You will likely discover that you have tables that haven&#8217;t been touched in years that you can safely remove. You will also notice that some tables have a disconnect between the number of times they&#8217;re written vs read &#8212; another optimization candidate. And finally, you take the additional step of looking at how the largest tables are used to see if you actually need them to be as big as they are. Two examples here are fields that are rarely queried, as well as the lookback period for time series data &#8212; you may have a table that goes back multiple years in history and discover that no one has written a query looking back more than 6 months.</p><h2>Usage of columns for a single table</h2><pre><code>WITH parsed_access_history AS (
 SELECT
   QUERY_ID,
   query_start_time,
   OBJECT_NAME,
   COLUMN_NAME
 FROM (
   SELECT
     QUERY_ID,
     query_start_time,
     OBJECTS.VALUE:objectName::STRING AS OBJECT_NAME,
     COLUMNS.VALUE:columnName::STRING AS COLUMN_NAME
   FROM
     snowflake.account_usage.access_history,
     TABLE(FLATTEN(INPUT =&gt; BASE_OBJECTS_ACCESSED)) OBJECTS,
     TABLE(FLATTEN(INPUT =&gt; OBJECTS.VALUE:columns)) COLUMNS
   WHERE query_start_time &gt;= '2025-02-01'
 )
)
SELECT
 column_name,
 count(1) as cnt
FROM
 parsed_access_history
WHERE
 OBJECT_NAME = 'DB.SCHEMA.TABLE'
GROUP BY all;</code></pre><p>Another use of the SNOWFLAKE.ACCOUNT_USAGE.TABLES view is to look at the usage of a table by column. Snowflake also has the BASE_OBJECTS_ACCESSED column as well as the DIRECT_OBJECTS_ACCESSED column, which allows you to see both the views (direct) as well as the underlying table (base) data being accessed. You can also join these against the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY view to pull the query type and the query text to dig in more.</p><p>This is useful to see if some columns have drastically different usage patterns which can signal that it may be worth creating other versions of the table, especially if they are high cardinality.</p><h2>Inefficient warehouse idle settings</h2><pre><code>WITH warehouse_events AS (
 SELECT *
 FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_EVENTS_HISTORY e
 WHERE event_name IN ('RESUME_WAREHOUSE', 'SUSPEND_WAREHOUSE')
   AND event_state = 'STARTED'
   AND e.warehouse_name = 'WH_NAME'
   AND e.timestamp &gt;= '2025-02-01'
   AND e.timestamp &lt; '2025-03-01'
),
minute_bounds AS (
 SELECT
   MIN(DATE_TRUNC('MINUTE', CAST(timestamp AS timestamp))) as min_ts,
   MAX(DATE_TRUNC('MINUTE', CAST(timestamp AS timestamp))) as max_ts
 FROM warehouse_events
),
minute_sequence AS (
 SELECT
   DATEADD(minute, seq4(), (SELECT min_ts FROM minute_bounds))::timestamp as minute_start
 FROM TABLE(GENERATOR(ROWCOUNT =&gt; 50000))
 WHERE minute_start &lt;= (SELECT max_ts FROM minute_bounds)
),
active_periods AS (
 SELECT
   warehouse_id,
   warehouse_name,
   event_name,
   CAST(timestamp AS timestamp) as start_time,
   LEAD(CAST(timestamp AS timestamp)) OVER (PARTITION BY warehouse_id ORDER BY timestamp) as end_time
 FROM warehouse_events
),
minute_activity AS (
 SELECT
   m.minute_start,
   DATEADD(minute, 1, m.minute_start) as minute_end,
   a.warehouse_id,
   a.warehouse_name,
   GREATEST(a.start_time, m.minute_start) as period_start,
   LEAST(COALESCE(a.end_time, CURRENT_TIMESTAMP()), DATEADD(minute, 1, m.minute_start)) as period_end
 FROM minute_sequence m
 LEFT JOIN active_periods a
   ON a.event_name = 'RESUME_WAREHOUSE'
   AND m.minute_start &lt; COALESCE(a.end_time, CURRENT_TIMESTAMP())
   AND DATEADD(minute, 1, m.minute_start) &gt; a.start_time
),
base_activity AS (
 SELECT
   COALESCE(warehouse_id, (SELECT warehouse_id FROM warehouse_events LIMIT 1)) as warehouse_id,
   COALESCE(warehouse_name, 'WH_NAME') as warehouse_name,
   minute_start as timestamp,
   period_end &gt; period_start as is_active,
   SUM(COALESCE(
     DATEDIFF('microsecond', period_start, period_end) / 60000000.0,
0
   )) as pct_active
 FROM minute_activity
 GROUP BY ALL
 HAVING pct_active &gt; 0 AND pct_active &lt; 1
),
grouped_activity AS (
 SELECT
   *,
   DATEDIFF('minute', LAG(timestamp) OVER (ORDER BY timestamp), timestamp) != 1 AS new_group
 FROM base_activity
),
numbered_groups AS (
 SELECT
   *,
   SUM(CASE WHEN new_group THEN 1 ELSE 0 END) OVER (ORDER BY timestamp) as group_id
 FROM grouped_activity
),
counts_for_min AS (
 SELECT
   warehouse_id,
   warehouse_name,
   MIN(timestamp) as start_timestamp,
   COUNT(*) as minutes_active,
   AVG(pct_active) as avg_pct_active
 FROM numbered_groups
 GROUP BY warehouse_id, warehouse_name, group_id
 ORDER BY start_timestamp
)
SELECT
 DATE(start_timestamp) as ymd,
 minutes_active,
 COUNT(1) as cnt
FROM counts_for_min
GROUP BY ALL
ORDER BY ymd, minutes_active;</code></pre><p>This one is long, but does something simple: it computes the consecutive minutes where a warehouse has been going through a resume/suspend loop. You probably already know that you can reduce your Snowflake cost by coming up with a more intelligent idle timeout, versus the default which is almost always set too high at 5 minutes. What you may not know &#8212; hidden in the Snowflake docs &#8212; is that you are going to be charged for at least 60 seconds, even if your warehouse runs for 10 seconds. That means if your warehouse goes through multiple resume/suspend sessions during a minute, you will be charged for multiple minutes of activity. The query above is meant to identify these cases where a warehouse resumes, runs a query, then suspends, only to be resumed again within the same minute. This is a sign that the idle timeout may be too low. That doesn&#8217;t mean increasing the timeout will be better, since during other times a lower timeout may be ideal, but it usually warrants another look.</p><p>In an ideal world, you&#8217;d have the following flow for each warehouse session: start the warehouse, run whatever queries you need, then explicitly suspend the warehouse without needing to rely on the auto idle settings. This is only possible in well defined and controlled workloads, and any sort of BI tool usage likely means you&#8217;re dealing with limited ability to predict usage so your warehouse will inevitably have periods where it fluctuates between being on and off.</p><p>These are just a few example queries to explore what&#8217;s happening under the hood. Snowflake captures rich metadata, though its dashboards don&#8217;t always make it easily accessible. With some effort, you really can answer an unlimited number of questions &#8212; writing the query is the easy part. The real challenge is knowing what to ask.</p>]]></content:encoded></item><item><title><![CDATA[The Real Cost of BigQuery]]></title><description><![CDATA[Why Serverless?]]></description><link>https://blog.twingdata.com/p/the-real-cost-of-bigquery</link><guid isPermaLink="false">https://blog.twingdata.com/p/the-real-cost-of-bigquery</guid><dc:creator><![CDATA[Adam Kabak]]></dc:creator><pubDate>Wed, 26 Feb 2025 18:37:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jSBz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jSBz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jSBz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png 424w, https://substackcdn.com/image/fetch/$s_!jSBz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png 848w, https://substackcdn.com/image/fetch/$s_!jSBz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png 1272w, https://substackcdn.com/image/fetch/$s_!jSBz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jSBz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png" width="495" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:495,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:274257,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/157979636?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jSBz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png 424w, https://substackcdn.com/image/fetch/$s_!jSBz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png 848w, https://substackcdn.com/image/fetch/$s_!jSBz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png 1272w, https://substackcdn.com/image/fetch/$s_!jSBz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9270e173-8de3-47e2-b312-ab9e5e695f22_495x342.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Why Serverless? Why BigQuery?</h2><p>In 2017, The Economist published an <a href="https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data">article</a> titled &#8220;The World&#8217;s Most Valuable Resource is No Longer Oil, But Data&#8221;. </p><p>As both the fuel that drives AI progress, as well as the core business of the largest companies in the world, it&#8217;s no wonder modern companies of all sizes, stages and industries implement a data strategy with a bias towards storing as much as possible, just in case. Since the hardware overhead and technical requirements of implementing and managing these systems in the traditional way put data systems out of the reach of small or agile businesses, cloud providers have emerged to fill the gap. These providers are able to leverage their scale to offer data systems affordable and accessible enough for businesses of any scale. At the pinnacle of accessibility sits the serverless model, where the provider securely distributes resources across accounts dynamically based on demand, serving them within managed environments that, to varying degrees and ways, abstract much or all of the complexity involved. In exchange, the user gets a simple way to reliably store and access their data at relatively transparent pricing. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Google Cloud&#8217;s implementation, BigQuery (which we'll shorthand as BQ), is a shining example of serverless infra. Google&#8217;s BQ documentation is robust, and while there will be some information overlap, the goal here is to add real-world usage context.</p><p></p><h2>Getting Into the The Guts of BigQuery</h2><p>High level, BigQuery (BQ) is a managed system that abstracts and integrates several proprietary elements, as well as many independent Google Cloud Platform (GCP) services. These elements populate, support or interact within BQ&#8217;s 2 main segments: the Data Plane and Control Plane.</p><ul><li><p><strong>Data Plane</strong>: Encompases the storage (Colossus), compute/execution (Dremel), and networking (Jupiter) layers, which contain logic and resources for storing and reading data as well as query processing. It executes queries by reading, processing and returning data from Colossus via Dremel using allocated slots.</p></li></ul><ul><li><p><strong>Control Plane</strong>: Responsible for metadata tracking, high-level resource allocation, query parsing/validation, and job orchestration.</p></li></ul><p>Here is the 50,000 ft view of the architecture at large:</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!im4t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!im4t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png 424w, https://substackcdn.com/image/fetch/$s_!im4t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png 848w, https://substackcdn.com/image/fetch/$s_!im4t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png 1272w, https://substackcdn.com/image/fetch/$s_!im4t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!im4t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png" width="1185" height="652" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:1185,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:680223,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/157979636?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!im4t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png 424w, https://substackcdn.com/image/fetch/$s_!im4t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png 848w, https://substackcdn.com/image/fetch/$s_!im4t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png 1272w, https://substackcdn.com/image/fetch/$s_!im4t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c9105d-ac98-40ca-99bd-16af383584ae_1185x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It&#8217;s a complex system, but the result is the orchestration of several highly-optimized components working in sequence and parallel to dynamically handle a wide variety of conditions and applications. As this is a general overview and not a deep dive into one particular component, we&#8217;ll stick to examining 2 significant elements in the Data Plane at mid-depth.</p><p><strong><a href="https://www.systutorials.com/colossus-successor-to-google-file-system-gfs/">Colossus</a></strong>, the underlying storage layer for several GCP services, including BQ and Google Cloud Storage (GCS), is fundamentally aligned with serverless principles; BQ&#8217;s integration with Colossus stands out by fully managing data-loss prevention with automatic multi-region data replication, at no additional cost. Within Colossus, BQ stores data in <strong><a href="https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format">Capacitor</a></strong> format, their proprietary, highly-compressed columnar storage structure. This formatting is also fully managed, with no configuration options available. Additionally, a storage optimizer periodically rewrites files in a format that is more efficient for querying. The primary 2 manual storage optimizations, partitioning and clustering, which we&#8217;ll touch on later, offer some user control, but even clustering on changing data requires a layer of automated re-clustering that&#8217;s managed internally.</p><p>The pattern of highly abstracted complexity doesn&#8217;t stop at Colossus. <strong><a href="https://research.google/pubs/dremel-interactive-analysis-of-web-scale-datasets-2/">Dremel</a></strong>, the query compute engine, rewrites or breaks down executed queries into stages, then enables the dynamic provisioning of very small resource units, called &#8220;slots&#8221;, in parallel for running workload stages across all existing orgs. The exact algorithm governing resource provisioning is, of course, not publicly available, but generally it&#8217;s based on a range of factors like query complexity and overall system load, and even down to a query&#8217;s particular execution stage. The goal of optimizing efficiency&#8212;and by extension, minimizing waste&#8212;is balanced against the need to ensure that some level of compute capacity remains available, even for inactive organizations (they call this &#8220;Fair Scheduling&#8221;).</p><p>With so many layers of management systems, the natural question is, 'How well does all this work?' While answering in full is beyond the scope of this section, it's largely an academic question. Users have no direct control over these internal mechanisms, aside from workload-level optimizations or switching providers. Given the sheer scale and automation of BigQuery, its performance remains a modern engineering feat, even if there are some inefficiencies.</p><p></p><h2>Breaking Down BigQuery Costs: Storage and Compute</h2><p>While storage and compute resources are largely automated, there are yet many ways to configure when and how they&#8217;re used.</p><h3>Storage</h3><p>BigQuery offers two primary pricing models, Logical and Physical Bytes Storage Billing (LBSB and PBSB respectively). Each model has a half-price long-term version as well, which is automatically set per table after 90 days without modification to the table.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X3aN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X3aN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png 424w, https://substackcdn.com/image/fetch/$s_!X3aN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png 848w, https://substackcdn.com/image/fetch/$s_!X3aN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png 1272w, https://substackcdn.com/image/fetch/$s_!X3aN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X3aN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png" width="530" height="247" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:247,&quot;width&quot;:530,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/157979636?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X3aN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png 424w, https://substackcdn.com/image/fetch/$s_!X3aN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png 848w, https://substackcdn.com/image/fetch/$s_!X3aN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png 1272w, https://substackcdn.com/image/fetch/$s_!X3aN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce704fa1-4c17-40da-8235-f3a1fcb2c7ba_530x247.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Metadata storage has quite the elaborate pricing model. That&#8217;s for good reason: the metadata is stored differently than user data because it&#8217;s accessed frequently by the system and crucial for warehouse functionality. If metadata storage begins to surpass the 2% + 10 GiB free tier threshold, it is likely an indication that there is a path to reducing metadata generation (we&#8217;ll touch on optimizations in the next section).</p><p>While not technically part of BigQuery, it&#8217;s also worth mentioning that strategically integrating GCS into BigQuery operations can enable other ways to save money on storage. Aside from handling unstructured and semistructured data storage, GCS is commonly used for storing archived data at a fraction of the cost. Most practically, GCS is ideal for staging tables due to its capacity for large-scale data ingestion, and broad support of data and compression formats. Also, hybrid pipelines can use GCS as the data bridge that connects to other storage systems.</p><h3>Compute</h3><p>Fully separated from storage pricing, compute pricing options range from fully on-demand to reserved capacity, or even a blend of both.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JT-d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JT-d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png 424w, https://substackcdn.com/image/fetch/$s_!JT-d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png 848w, https://substackcdn.com/image/fetch/$s_!JT-d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png 1272w, https://substackcdn.com/image/fetch/$s_!JT-d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JT-d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png" width="625" height="517" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:517,&quot;width&quot;:625,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71659,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.twingdata.com/i/157979636?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JT-d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png 424w, https://substackcdn.com/image/fetch/$s_!JT-d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png 848w, https://substackcdn.com/image/fetch/$s_!JT-d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png 1272w, https://substackcdn.com/image/fetch/$s_!JT-d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd302c6ed-a16d-419a-af74-156505ee5cd4_625x517.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>On demand: You&#8217;re charged based on the amount of data scanned. There is no minimum usage, and the first 1 TiB per month is free. Fair scheduling reasonably ensures that each query has equal access to available slots, whether the org is highly active or mostly inactive. The resource limit for on-demand usage is 100 concurrent queries and/or 2000 slots at a time (per project per org), with 2 exceptions: the slot limit is subject to availability, and if even more are available they might be used to push through small workloads (this they call transient burst capability)</p></li></ul><ul><li><p>Capacity based: In this scheme, charging is based on the amount of slots &#8220;reserved&#8221;. This reservation can be a static amount, or, optionally, a min/max range can be set to enable autoscaling for highly dynamic or unpredictable requirements. An optional 1-3 year commitment is available for Enterprise/Enterprise Plus tiers, offering discounts for committing to continuous billing for an amount of persistent compute capacity. While this ensures resource availability at all times, any autoscale compute used in addition to commitment slots are charged at the un-discounted rate. At higher volumes and commitment levels, it&#8217;s also possible to negotiate additional discounts. Finally, the pay-as-you-go option, available on all capacity-based tiers, is similar to on-demand in that charges only occur while resources are in use. One caveat here is that the set minimum reserved slots are the minimum to be used, no matter how small the workload. Another is that resources made available for a workload stay up for 60 seconds after use. With certain cadences of usage, this can be a factor worth investigating.</p></li></ul><p>These pricing options can be pieced together like legos to fit your needs. Slots can be reserved per team, project, folder, or even workload type. Scaling on reserved slots can use capacity-based autoscaling or on-demand slots.</p><p>Finally, it&#8217;s worth mentioning that when these capacity-based &#8220;editions&#8221; were introduced a few years ago, they discontinued the flat-rate + flex slots model, which allowed a more deliberate or scheduled approach to reserved resource scaling. Now, even capacity-based scaling is 100% automated, further leaning away from traditional provisional models and into a fully-managed experience.</p><p></p><h2>Managing a Managed System</h2><p>The fully-managed services within BigQuery are highly nuanced and opinionated. Understanding how they work and what they prefer are crucial for optimizing usage. Levers like query composition, partitioning/clustering tables, metadata management and more make massive impacts on efficiency and costs. As usual, the official documentation covers optimizations fairly well, but we&#8217;ve found there&#8217;s so much nuance that even talented teams can overlook major points. I&#8217;ll drill into one peculiarity not addressed by BigQuery&#8217;s query optimizer.</p><p>Consider the following query and execution graph:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FHtZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FHtZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png 424w, https://substackcdn.com/image/fetch/$s_!FHtZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png 848w, https://substackcdn.com/image/fetch/$s_!FHtZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png 1272w, https://substackcdn.com/image/fetch/$s_!FHtZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FHtZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png" width="496" height="315" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:315,&quot;width&quot;:496,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FHtZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png 424w, https://substackcdn.com/image/fetch/$s_!FHtZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png 848w, https://substackcdn.com/image/fetch/$s_!FHtZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png 1272w, https://substackcdn.com/image/fetch/$s_!FHtZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec04798-d565-48e4-b3c3-a58417a1d197_496x315.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Since cte1 is referenced twice, BigQuery executed the CTE both times. Under the hood, BQ tends to treat non-recursive CTEs by inlining them in the final query plan as part of the plan creation process.</p><p>A possible optimization is to materialize the data that will be used multiple times:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Of9D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Of9D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png 424w, https://substackcdn.com/image/fetch/$s_!Of9D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png 848w, https://substackcdn.com/image/fetch/$s_!Of9D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png 1272w, https://substackcdn.com/image/fetch/$s_!Of9D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Of9D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png" width="549" height="429" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebae5032-cb4f-46be-8d66-8191bb60039a_549x429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:429,&quot;width&quot;:549,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Of9D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png 424w, https://substackcdn.com/image/fetch/$s_!Of9D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png 848w, https://substackcdn.com/image/fetch/$s_!Of9D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png 1272w, https://substackcdn.com/image/fetch/$s_!Of9D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febae5032-cb4f-46be-8d66-8191bb60039a_549x429.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Of course, this scenario is a bit contrived. And, in this simple example, while slot seconds are saved, billed bytes are equivalent. That means we get increased efficiency, but with on-demand pricing the cost doesn&#8217;t change. With increased CTE complexity, though, all metrics, including bytes processed/billed are dramatically affected in expected and unexpected ways. While it might appear obvious that reducing compute and data scanned by caching partial data as an initial step is beneficial, consolidating the entire query can help the optimizer in other ways that can offset the CTE reruns.</p><p>The high-level takeaway is that it might be worth testing significant workloads that fit this profile with temp tables.</p><p>But wait, there&#8217;s more! Temp tables in BigQuery are session-based. As long as the session is live, any temp tables created within persist. In the console, a session in this context really only means per tab, per single or multi-statement execution. But the real value is in its API usage, through what they creatively call BigQuery Sessions. A session can be generated during an API call, then referenced in subsequent API calls to associate with the same session. That means an early API call can generate temp tables and later calls can use them. Essentially, it amounts to a caching layer for larger datasets, where intermediate results remain in BigQuery instead of overloading front or back ends.</p><p>An example of this put to great use is for dashboards with multiple levels of drill downs, where complex aggregations are done at each level. Within a session, a temp table can store the data for each level, so the user can navigate down, back out, then down again with no need to wait for redundant heavy computation or rely on client-side storage.</p><p>As I said before, temp tables persist for the duration of the session. A session, if not manually terminated, will deactivate automatically after 24 hours of inactivity or 7 days after creation. Depending on the application, this can have a major impact on a system&#8217;s performance, cost efficiency, and user experience.</p><p></p><h2>The TL;DR on BigQuery</h2><p>BigQuery and its surrounding ecosystem is an incredible complex of intricate, interlocking systems operating at enormous scale. Every piece contains so many layers of automated management based on unknowable conditions. With almost no barriers, it really just works.</p><p>Having said that&#8230; (Curb, anyone?), to work <em>with</em> the system instead of <em>against</em> it requires a relatively deep understanding of proprietary behaviors. In a traditional warehouse, user management means to manage resources in a cluster, tuning query execution, and provisioning capacity. In BigQuery, control shifts from explicit resource management to understanding the inner workings of BigQuery&#8217;s execution model, slot allocation, caching mechanisms, and query optimizations.</p><p><em>We put together a similar guide for Redshift last month &#8211; you can find it <a href="https://blog.twingdata.com/p/the-real-cost-of-redshift-warehousing">here</a></em></p><p>At Twing Data, we specialize in saving clients time and money on their data systems. By analyzing warehouse metadata, our stack-agnostic product builds end to end system models based on how they&#8217;re actually used, surfacing underused resources, overlooked workload inefficiencies, suboptimal warehouse scaling, and much more.</p><p><strong>Curious if there&#8217;s room to optimize your data spend?</strong> We&#8217;d love to take a look and share what we find&#8212;no strings attached.</p><p>Stay tuned for a deeper dive into optimizing within BigQuery.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Trust But Verify: Evaluating Security When Choosing Data Infrastructure Tools]]></title><description><![CDATA[A data leader's guide to maintaining security when working with third-party vendors]]></description><link>https://blog.twingdata.com/p/trust-but-verify-evaluating-security</link><guid isPermaLink="false">https://blog.twingdata.com/p/trust-but-verify-evaluating-security</guid><dc:creator><![CDATA[Simon]]></dc:creator><pubDate>Tue, 18 Feb 2025 20:09:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fGvo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fGvo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fGvo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp 424w, https://substackcdn.com/image/fetch/$s_!fGvo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp 848w, https://substackcdn.com/image/fetch/$s_!fGvo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp 1272w, https://substackcdn.com/image/fetch/$s_!fGvo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fGvo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp" width="660" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:660,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89348,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fGvo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp 424w, https://substackcdn.com/image/fetch/$s_!fGvo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp 848w, https://substackcdn.com/image/fetch/$s_!fGvo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp 1272w, https://substackcdn.com/image/fetch/$s_!fGvo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2e1aa74-f369-4d7d-99ae-329fd7874468_660x484.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How data is stored on a server, as envisioned by Hackers (1995) Dir. Iain Softley &#169;Metro-Goldwyn-Mayer, Inc.</figcaption></figure></div><p>Data warehouse optimization tools connect to platforms such as Snowflake, Amazon Redshift, and Google BigQuery to analyze usage and improve performance and cost. Since these tools interface with sensitive enterprise environments, robust security practices are critical.</p><p>Below, we&#8217;ll dive into security aspects that you should consider when choosing warehouse optimization and observability tools for your organization. More generally, these guidelines can apply to other tools that access your infrastructure or data.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Data Access</h2><p>How can you be confident in the safety of the tool that's analyzing your warehouse? That's a tough question. By knowing what it can access and by what means, you can begin to form an answer.</p><h3>Least Privilege</h3><p>An important aspect of security is the &#8220;principle of least privilege&#8221;, or making sure that different services (especially tools that aren&#8217;t developed internally) can only access what they need to be effective. In a data mart set up, that can be <em>a lot</em> of critical information, so it&#8217;s important to minimize exposing anything that doesn&#8217;t need to be.</p><p>Limiting an integration to a dedicated user is a good first step. Giving this user minimal permissions, such as read-only metadata-only access, also alleviates some risk as this user could be revoked with minimal disruption if needed.</p><p>For more complex integrations, it may be necessary to have role-based access controls (RBAC) to manage who has access to what. For example, someone inspecting the metadata may not be the same person setting up billing or data connections, but maybe they should still be able to see what connections are set up. Some tools can go even further and support data-specific privilege and limit who can access different parts of the (meta)data and optimization and observability results themselves.</p><p>Twing Data takes this principle seriously by supporting secure connections to your warehouses and limiting access to read-only metadata by design and following best practices internally.</p><h3>API &amp; Integration Security</h3><p>Different vendors have different, and usually multiple, ways to connect to their systems such as passwords, key-pair authentication, gateways, or entirely unique service accounts.</p><p>It&#8217;s important to remember that encryption is only as strong as its implementation, and to verify that a more secure method actually is more secure. For example, simply switching from HTTP to HTTPS doesn&#8217;t guarantee security if certificate validation is improperly handled, allowing for man-in-the-middle attacks. Similarly, using OAuth 2.0 for API authentication is a step up from basic authentication, but failing to properly manage token expiration, refresh, and revocation could expose your integration to token hijacking or replay attacks. Always evaluate the entire security chain, not just the algorithm or protocol. We&#8217;ll talk more about encryption below.</p><p>A quick internet search can also reveal any vulnerabilities that were exposed in the past for a given vendor or user.</p><h2>Data Management</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ip30!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ip30!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ip30!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ip30!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ip30!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ip30!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg" width="2531" height="2125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2125,&quot;width&quot;:2531,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1460523,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ip30!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ip30!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ip30!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ip30!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d5eef-d334-461b-bad0-9f679eab8e79_2531x2125.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@mauriciogutierreztello">Mauricio Guti&#233;rrez</a> on <a href="https://unsplash.com/">Unsplash</a></figcaption></figure></div><p>Whether it&#8217;s just metadata like the queries being run or the data they return, whoever is analyzing the data needs to ingest and store it somewhere for analysis, and possibly to surface it back to their users. This section analyzes which aspects may require a closer look.</p><h3>Data Retention</h3><p>Since the vendor may have access to business-critical data (or metadata) it&#8217;s important to consider their data retention policy. What a &#8220;good&#8221; policy is will vary from business to business and use case to use case, but typically you don&#8217;t want someone to hold on to your data &#8220;forever&#8221; (especially not past the point of canceling the service). Long-term data retention can introduce unnecessary risks, such as unauthorized access, data breaches, or compliance violations. On the flipside, too short of retention might give less useful insights.</p><p>When vetting vendors, ask about their data retention and deletion policies up front and negotiate shorter retention periods or periodic purges, especially for sensitive information. It&#8217;s also a good idea to have an offboarding plan that verifies that the third-party has deleted your data after your integration has ended.</p><p>While some vendors don&#8217;t share this policy, Twing Data retains 90 days of metadata and persists the resulting analysis during the duration of service. This can be adjusted on a company-by-company basis if requested.</p><h3>Data Isolation</h3><p>There are also different methods for segmenting or isolating data and access. Separating customers&#8217; data from one another can be done by setting up separate environments for each customer, ensuring that risk from one customer isn&#8217;t carried over to another. This can be especially important for services with privileges that extend beyond read-only, such as those that automatically adjust the size, clustering, or resources of your warehouses, or services that rely on data beyond metadata.</p><p>Twing Data can safely grant low-level access to our users by only exposing our analyses through company-specific accounts and views, ensuring that users (or threat actors) aren&#8217;t able to view anything they&#8217;re not supposed to.</p><h3>Data Anonymization</h3><p>Besides the tactics listed above, a vendor can, and should, anonymize sensitive data &#8212; even metadata. Even seemingly innocuous queries that may be executed can contain personal information that is not necessary for an external tool to see. Consider a query such as &#8220;WHERE username = &#8216;celebrity@personalemail.com&#8217; AND phonenumber = &#8216;(555) 867-5309&#8217;&#8221; being exposed. A leaked query such as this not only exposes a high-profile individual&#8217;s email address but also ties it to their phone number.</p><p>Twing Data offers to redact query text at ingestion, replacing it with a constant placeholder token&#8203;. This way, we can still analyze query structures (e.g. to identify patterns or heavy queries) without storing any actual identifiers or literals. Keep in mind, though, some analyses may not be as useful if we can't identify what makes different query pattern executions different from one another. We&#8217;re always looking to strike the right balance..</p><h3>Encryption (Data at Rest &amp; In Transit)</h3><p>Nowadays, it&#8217;s safe to assume that any enterprise integration is going to use a secure protocol, but it&#8217;s still good to make sure. Any person or server that is accessing your data (metadata or otherwise) should be doing so securely. If not &#8212; run!</p><p>The (meta)data itself should also be stored securely. Twing Data leverages Google BigQuery which <a href="https://cloud.google.com/bigquery/docs/encryption-at-rest">encrypts data at rest</a> and adds additional access controls and auditing.</p><h2>App Architecture</h2><p>The &#8220;cloud&#8221;, as we know, is generally &#8220;someone else&#8217;s servers&#8221;. It enables distributing and scaling applications across the world while optimizing cost and speed.</p><p>However, as infrastructure evolves and becomes more complex, there are more points of failure and more visibility and failsafes are needed. Let&#8217;s take a look at what that means for warehouse optimization tools.</p><h3>Cloud Infrastructure Security</h3><p>Similar to data isolation, containerization or virtual machine (VM) isolation can ease security concerns, especially for more than read-only access. A setup like this may spin up VMs per tenant, either on-premise or managed by another provider like AWS or GCP.</p><p>Some providers, like fly.io (which Twing Data uses for hosting our application), are easy to scale globally while maintaining security practices. Twing Data&#8217;s analyses run on demand and on-premises, interfacing securely with our databases.</p><h3>Logging and Monitoring</h3><p>For any organization, it&#8217;s important to log events to detect, prevent, or alert on security issues or outages. For big data, an outage can mean inaccurate insights and lost revenue, so alerting and monitoring is crucial. Similarly, monitoring and alerting help catch bugs or security vulnerabilities sooner than later (with additional context about what happened and why), making it quicker to resolve them.</p><p>For example, if an API integration suddenly starts returning a high volume of 500 errors, alerting can notify the team before customers are impacted. In a security context, logs showing a spike in failed authentication attempts could indicate a brute-force attack, allowing for a proactive response. Additionally, monitoring data pipeline performance can catch bottlenecks &#8211; like a slow ETL job &#8211; that might otherwise cause reporting delays or incomplete/inaccurate analytics.</p><p>Some services (like fly.io) offer built-in logging, monitoring, and alerting (e.g. Grafana and Sentry integrations) that makes it fast, easy, and secure to set up the necessary infrastructure.</p><h2>Company Practices</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6e7X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6e7X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6e7X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6e7X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6e7X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6e7X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg" width="1456" height="973" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:973,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3807988,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6e7X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6e7X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6e7X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6e7X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F106605b4-569b-4617-bc5e-e3d6751f04e7_3680x2458.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@chrumo">Piotr Chrobot</a> on <a href="https://unsplash.com/">Unsplash</a></figcaption></figure></div><p></p><p>Lastly, company practices can give additional insight into what to expect when working with them. Do they take security seriously as part of their culture, or just to check some boxes? If a problem occurs, how likely are they to fix it within a given timeframe?</p><h3>Compliance &amp; Certification</h3><p>Some certifications (SOC 2 Type I) evaluate security at a single point in time, and others (SOC 2 Type II) evaluate security over a period of time. These evaluations are based on many factors such as access controls, monitoring &amp; logging, incident response, data encryption &amp; integrity, secure development practices, and others.</p><p>Seeing that a vendor has these certifications gives a clear indication of their security practices. It&#8217;s something Twing Data is working towards!</p><h3>Incident Response &amp; Recovery</h3><p>A company&#8217;s processes for detecting, responding to, and recovering from security incidents is just as important as their processes to avoid and alert on such incidents. After all, an alert is only valuable if it cuts through the noise and prompts action &#8211; otherwise, it&#8217;s just another overlooked warning.</p><p>SOC2 certification requires to document related processes thoroughly and test them at least annually. Of course, companies without the certification can still have a plan in place.</p><p>These processes cover responsibilities and actions that would be performed internally (e.g. patching a vulnerability) and externally (e.g. notifying affected users) with clear steps for identifying, containing, mitigating, and resolving incidents. In some cases, a service level agreement (SLA) between the vendor and customer may dictate time-to-resolution requirements, ensuring accountability.</p><p>For example, if a critical data breach occurs, an engineering team might immediately isolate the affected systems while security and legal teams work together to draft and send a breach notification within the SLA-defined timeframe.</p><h3>Secure Development Practices</h3><p>Cutting-edge companies should embed security throughout the development lifecycle. Increasingly, security and code quality checks are happening earlier in the development lifecycle. Tools like static code analysis, linting, code reviews, and automated security scanning help catch issues before they even reach a staging environment.</p><p>Security-minded SaaS providers also conduct regular automated and manual penetration testing on their application and infrastructure&#8203;, and any critical findings are fixed as part of their vulnerability management process.</p><p>By embedding security holistically within the development workflow, companies can better safeguard sensitive data and reduce the overall cost and risk of fixing vulnerabilities later.</p><h3>Third-Party Risk Management</h3><p>Third-party risks typically encompass technical topics such as security breaches and downtime, but can include more qualitative risks like reputational risk.</p><p>Using reputable third parties like GCP and fly.io that have their own certifications is one way to mitigate risk that comes with outsourcing to external service providers, especially when it comes to some of the topics above around data and cloud security &#8212; items that are most commonly outsourced (since maintaining your own hardware comes with its own high costs and risks).</p><p>Vendors can also be classified based on risk &#8212; a database vendor is of higher risk than an office supplier, for example, and possibly doesn&#8217;t need to be reassessed as frequently.</p><h2>Choosing the right data vendor for your business</h2><p>As I was writing and researching for this article, the main takeaway became clear: it's essential to choose a vendor that analyzes your company's unique and mission-critical data effectively and securely. I hope these findings help you mitigate that risk by knowing what to look for and avoid when digging into vendors&#8217; security policies.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support our work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Redshift Serverless Pricing]]></title><description><![CDATA[A Comprehensive Overview]]></description><link>https://blog.twingdata.com/p/redshift-serverless-pricing</link><guid isPermaLink="false">https://blog.twingdata.com/p/redshift-serverless-pricing</guid><dc:creator><![CDATA[Adam Kabak]]></dc:creator><pubDate>Tue, 11 Feb 2025 21:31:01 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-MFe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-MFe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-MFe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg 424w, https://substackcdn.com/image/fetch/$s_!-MFe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg 848w, https://substackcdn.com/image/fetch/$s_!-MFe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!-MFe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-MFe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg" width="912" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:912,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Inigo Montoya I Do Not Think That Word Means What You Think It M | REDSHIFT? SERVERLESS? They keep using that word. I do not think they mean what you think they mean | image tagged in inigo montoya i do not think that word means what you think it m | made w/ Imgflip meme maker&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Inigo Montoya I Do Not Think That Word Means What You Think It M | REDSHIFT? SERVERLESS? They keep using that word. I do not think they mean what you think they mean | image tagged in inigo montoya i do not think that word means what you think it m | made w/ Imgflip meme maker" title="Inigo Montoya I Do Not Think That Word Means What You Think It M | REDSHIFT? SERVERLESS? They keep using that word. I do not think they mean what you think they mean | image tagged in inigo montoya i do not think that word means what you think it m | made w/ Imgflip meme maker" srcset="https://substackcdn.com/image/fetch/$s_!-MFe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg 424w, https://substackcdn.com/image/fetch/$s_!-MFe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg 848w, https://substackcdn.com/image/fetch/$s_!-MFe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!-MFe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F222bc425-a7fe-4c7d-baa3-e8ec61f2d310_912x500.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>To continue the Redshift pricing topic introduced in my previous blog, <a href="https://blog.twingdata.com/p/the-real-cost-of-redshift-warehousing?utm_source=activity_item">The Real Cost of Redshift Warehousing</a>, I want to zoom in on Redshift Serverless. This product can be a major convenience for the right user, but it&#8217;s vitally important to understand the documented, and undocumented, details that govern their pricing policies in order to evaluate if it&#8217;s the right choice for your company.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>In case you haven&#8217;t read my previous blog or are otherwise unfamiliar with Redshift Serverless, let me quickly outline the basics. The term &#8220;serverless&#8221; means that you don&#8217;t need to manage the servers&#8212;the provider handles all the heavy lifting behind the scenes (even though, of course, servers are still there). In a serverless data warehouse, the infrastructure is fully managed for you, scaling automatically with usage so that you only pay for the compute you actually use, with storage priced separately. Redshift&#8217;s take on a serverless system follows this model, though with some high-impact nuances that are the focus of this post.</p><p></p><h2>Limitations of Redshift Serverless: Looking Under the Hood</h2><p>When considering whether or not Redshift Serverless is right for you, it&#8217;s essential to understand what it actually is in contrast to a true serverless system like BigQuery, and how it can more accurately be thought of as a version of their typical provisioned cluster. To establish a frame of reference, let me explain a bit about both truly serverless systems and the Redshift provisioned cluster:</p><ul><li><p>As an example of a true serverless system, BigQuery, Google&#8217;s cloud data warehouse offering, was built from the ground up to provide all the hallmarks of this class: a shared compute pool dynamically allocates resources to tenants as needed, charged only for use, and storage is handled completely independently, billed per MiB per second. To be sure, BigQuery is a feature-rich solution with many configurations and an option provisioning compute as well, so I am simplifying here.</p></li></ul><ul><li><p>A provisioned cluster is Redshift&#8217;s standard system option in which the user defines the attributes of a cluster - like node size/type, its execution behavior - like Workload Management (WLM) or Short Query Acceleration (SQA), as well as its scaling boundaries, as in Concurrency Scaling and Elastic/Manual Resize. Storage either falls within Redshift Managed Storage (RMS) or uses external S3 and is accessed at a small premium via Spectrum. RMS capacity is linked to cluster composition, as different node sizes/types offer varying max RMS capacity per node. In this way, while storage is priced separately, it is yet tied to compute. A provisioned cluster offers the cheapest per-unit compute pricing, but not only do charges accrue as long as it&#8217;s running, even when idle, but also potentially requires an enormous amount of configuration, monitoring and management across both workloads and storage to achieve reasonable cost and efficiency.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!it9k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!it9k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png 424w, https://substackcdn.com/image/fetch/$s_!it9k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png 848w, https://substackcdn.com/image/fetch/$s_!it9k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png 1272w, https://substackcdn.com/image/fetch/$s_!it9k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!it9k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png" width="1456" height="1498" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1498,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:699959,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!it9k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png 424w, https://substackcdn.com/image/fetch/$s_!it9k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png 848w, https://substackcdn.com/image/fetch/$s_!it9k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png 1272w, https://substackcdn.com/image/fetch/$s_!it9k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5581bea9-b04b-4f7b-b104-cea3bcc2fe89_1472x1514.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR: A Redshift cluster is an isolated instance with relative compute and storage limits managed at the individual level and rented by the hour, while a traditional serverless system decouples storage and compute, scaling them independently and pricing on demand.</strong></p><p>With these differences in mind, let&#8217;s now examine Redshift Serverless. While traditional serverless leverages multi-tenant compute infrastructure, Redshift Serverless, under the hood, is actually running on clusters of nodes identical to those that can be configured and run manually. Rather than accessing a shared, always-on system of clusters for compute, Redshift&#8217;s version merely manages an isolated system on the user&#8217;s behalf and prices compute in Redshift Processing Units (RPU). It&#8217;s important to understand this for a few reasons:</p><ol><li><p>RMS storage has limits. Serverless storage limits are addressed in exactly 1 bullet point in the docs, and if you miss it, you could reasonably be under the impression that it scales limitlessly, automatically, and only increases Redshift Managed Storage (RMS) costs. This is not true. RMS has max capacity based on node types and counts in the cluster in use. The minimum RPU settings essentially define the amount of available RMS. If data volume exceeds that max capacity, the minimum RPUs that will be used for any workload will be the number that can handle the needed amount of storage. The way to work around this is to manually place a portion of the data in external S3. While Spectrum is required to scan external S3 data, its use is included in the per RPU pricing.</p></li></ol><ol start="2"><li><p>&#8220;On-demand compute&#8221; is not always available on demand. Even when no compute resources are being used, there are ongoing costs for things like hardware maintenance, power, and cooling. With an on-demand pricing model, customers only pay when compute is in use, but the host still has to handle these ongoing costs. In a truly serverless system, a multi-tenant architecture manages this by balancing resource usage across many customers, which helps reduce waste. Amazon&#8217;s version of serverless boasts on-demand pricing, yet manages resources at the level of individual accounts. There&#8217;s no mechanism to pass unused resources from one account to another. Do they simply eat the cost of idle resources for their inactive serverless customers? Absolutely &#8230;not. After 1 hour of idling, the instance fully deactivates. When it&#8217;s inevitably needed again, the entire cluster has to reactivate first, which, in our experience, can take up to around 10 seconds. If the efficiency of that workload was business-critical, RIP. Since the one-hour deactivation threshold isn&#8217;t adjustable, your only options are to schedule important queries within an hour of another query or to send periodic pings to keep the system active.</p></li></ol><p></p><h2>Redshift Serverless Minimum Pricing Nuances: The Risk of Blind Trust</h2><p><strong>Minimum Charge Policy:</strong> &#8220;You pay for the workloads you run in RPU-hours on a per-second basis, with a 60-second minimum charge.&#8221;</p><p>On first pass the policy seems aggressive but pretty straightforward. A naturally-emerging follow-up question, &#8220;How many RPUs factor into that minimum?&#8221; actually has an innocent answer: your base (minimum) RPUs. Even if we stop here and assume that&#8217;s all there is to it, it&#8217;s clear that without purposefully managing your workloads around the minimum charge policy, your warehouse bill will likely be many times what it should.</p><p>We&#8217;ve indeed found Redshift&#8217;s minimum pricing policies to have an outsized impact on our clients&#8217; warehouse costs. And not just because the policy is completely overlooked. The docs are actually deceptive on this topic. What constitutes &#8220;...workloads you run&#8230;&#8221; is actually a combination of the queries you execute and system queries related to the environment and metadata tracking processes. Also included are mysterious seconds that I haven&#8217;t been able to account for. Consider the following query and results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!auhs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!auhs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png 424w, https://substackcdn.com/image/fetch/$s_!auhs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png 848w, https://substackcdn.com/image/fetch/$s_!auhs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!auhs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!auhs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png" width="810" height="1016" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1016,&quot;width&quot;:810,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150311,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!auhs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png 424w, https://substackcdn.com/image/fetch/$s_!auhs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png 848w, https://substackcdn.com/image/fetch/$s_!auhs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!auhs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8718e7ac-4b24-488a-ab0b-68f9b47c1ee5_810x1016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!plYI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!plYI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png 424w, https://substackcdn.com/image/fetch/$s_!plYI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png 848w, https://substackcdn.com/image/fetch/$s_!plYI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png 1272w, https://substackcdn.com/image/fetch/$s_!plYI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!plYI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png" width="1456" height="517" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:517,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!plYI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png 424w, https://substackcdn.com/image/fetch/$s_!plYI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png 848w, https://substackcdn.com/image/fetch/$s_!plYI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png 1272w, https://substackcdn.com/image/fetch/$s_!plYI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae5b6e54-733b-428f-a2de-424269eb925a_1600x568.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Since this is rolled up by minute, you might assume that query_run_seconds should be the same as total_used_seconds. That&#8217;s quite a discrepancy! 22 seconds -&gt; 177? There&#8217;s even more: in any of these minutes I actually ran between 0 and 2 queries. Queries logged to sys_query_history include those which ran automatically as part of background processes, which may or may not be directly associated with the workload you ran. To understand what&#8217;s happening, let&#8217;s dig in to examine the queries logged for minute 15:46:00:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OTHB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OTHB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png 424w, https://substackcdn.com/image/fetch/$s_!OTHB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png 848w, https://substackcdn.com/image/fetch/$s_!OTHB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png 1272w, https://substackcdn.com/image/fetch/$s_!OTHB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OTHB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png" width="1456" height="349" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf4022e9-c313-470f-9854-764d8473c991_1600x384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:349,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OTHB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png 424w, https://substackcdn.com/image/fetch/$s_!OTHB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png 848w, https://substackcdn.com/image/fetch/$s_!OTHB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png 1272w, https://substackcdn.com/image/fetch/$s_!OTHB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4022e9-c313-470f-9854-764d8473c991_1600x384.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The highlighted queries are the ones I wrote. The last one wasn&#8217;t even run in that minute, but added to 15:46:00 later. Everything else was often-redundant metadata fetching or tracking/validation behavior related to the environment or systems. 3 queries became 16. At least they&#8217;re fast. But, given that the total query_run_seconds for all 16 queries is 3 seconds, yet the calculated total_used_seconds is 24, we can guess that there are some major associated, undocumented processes that accrue to sys_serverless_usage.compute_seconds. Here&#8217;s an even more surprising example from a few minutes previous:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QZ7p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QZ7p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png 424w, https://substackcdn.com/image/fetch/$s_!QZ7p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png 848w, https://substackcdn.com/image/fetch/$s_!QZ7p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png 1272w, https://substackcdn.com/image/fetch/$s_!QZ7p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QZ7p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png" width="1456" height="128" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:128,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QZ7p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png 424w, https://substackcdn.com/image/fetch/$s_!QZ7p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png 848w, https://substackcdn.com/image/fetch/$s_!QZ7p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png 1272w, https://substackcdn.com/image/fetch/$s_!QZ7p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad2f2727-de72-4617-b2d0-ff5379396371_1600x141.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>I ran 1 query during this minute, and it maybe took a second or two. RS recorded 3 runs of the same query, one taking 22 seconds! Then, the total ~22 query_run_seconds recorded for the entire minute translated to 177 seconds used. A two second query run became 177 seconds &#8220;used&#8221;!</p><p>For this experiment I used the smallest base capacity possible, and didn&#8217;t dare go over the minimum charges per minute. In the real world this sort of behavior can lead to massive overspend. I wish I was able to wrap this up with a recommended implementation, and perhaps we&#8217;ll discover one and write a blog. The conclusion to make so far is that this behavior must be monitored. At a certain scale of overcharge, managing your own cluster might be worth considering.</p><p></p><h2>Who Benefits From This Setup?</h2><p>As a rule, abstracting away lower-level complexity introduces various types of costs that are difficult to track and sometimes impossible to correct. Serverless systems have some degree of this issue by nature, prioritizing simplicity and efficiency at a premium price. I would not put Redshift Serverless in this category, though. While it nails the premium price tag, the simplicity and efficiency are not included to the same degree. Where true serverless systems are built from the ground up to benefit from the scale of distributed compute and separate storage, Redshift&#8217;s version is merely a repackaging of the same isolated system offering, with some abstracted optionality based on an arbitrary, derivative metric RPU. No shared, always-on compute pool, no care-free limitless storage, but you will get a black box management layer that partially obstructs the visibility into how it operates.</p><p>On the positive side, if your stack already includes other AWS products, RS Serverless does earn simplicity points, as AWS products integrate nicely together. And if you&#8217;re otherwise a prime candidate for a serverless warehouse (ie: lacking monitoring knowledge/resources or have uneven/uncertain workloads) it might still be the best option for you. Just make sure to deliberately set usage limits and base capacity (default is 128 RPU, which may be too much).</p><p>At Twing Data, we use our warehouse-agnostic query parsing and analysis system to help our clients surface and diagnose a variety of inefficiencies, anomalous behaviors, and antipatterns across the data stack. We are always pushing to better understand our clients&#8217; stacks. If you have insights into how to best use or understand Redshift Serverless, please reach out, or, better yet, leave a comment here for others to find!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[10,000 database tables isn’t cause for celebration]]></title><description><![CDATA[It&#8217;s a sign that your data team doesn&#8217;t have enough business context]]></description><link>https://blog.twingdata.com/p/10000-database-tables-isnt-cause</link><guid isPermaLink="false">https://blog.twingdata.com/p/10000-database-tables-isnt-cause</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Tue, 04 Feb 2025 19:07:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kfwm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kfwm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kfwm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kfwm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kfwm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kfwm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kfwm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg" width="500" height="616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59864,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kfwm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kfwm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kfwm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kfwm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde45dcd3-1cdd-49b9-9a0e-51ca575a20ba_500x616.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Teams are storing petabytes of data, yet stakeholders don&#8217;t trust the reporting, or even the data points underlying them. Unfortunately, this is a common occurrence when data teams are the recipients of an endless queue of isolated requests, to be completed without further interrogation.</p><p>Our best advice on making data teams strategic partners, rather than simply receivers of requirements, might also be counterintuitive: start manually. To understand why, let&#8217;s start with the biggest challenges we&#8217;ve come across.</p><h2><strong>Siloed data teams</strong></h2><p>We see this often in our work: data teams are often too far removed from the product, without strong product management. One team will produce a data point, and then it falls to the data engineering team to transform it into a table or a report that can subsequently inform business decisions. But the feedback loop between getting the data and making it actionable is often broken.</p><p><a href="https://roundup.getdbt.com/p/ep-47-ramps-8-billion-data-strategy">Ramp</a>, recognizing this challenge, has implemented a successful approach that embeds data skills within the business-aligned teams. They allow more contributors to the code base, enabling the organization to tighten the feedback loop between data insights and the business decisions that follow from them.</p><h2><strong>Requirements without context</strong></h2><p>Data teams are often on the receiving end of requests from other teams, and are forced to support a growing list of requirements without having the necessary context to understand the business case. As a result, teams can&#8217;t make reasonable tradeoffs, and end up developing Frankenstein models that become more costly to maintain.</p><p>We&#8217;ve seen cases where teams are proud that they have thousands of dbt models, but we think teams should be aiming for the opposite: doing more with less. The benefit is obvious &#8211; by simplifying your data, you&#8217;ll have less of it to manage. Queries will run faster, insights will be generated faster, and you&#8217;ll be able to move faster since you&#8217;re dealing with less overhead.</p><h2><strong>Our approach: Start manually</strong></h2><p>The best advice we can give is to validate the need and the solution manually at the outset. Data storage is much cheaper than compute &#8211; so rely on the fact that you can always rederive the necessary data, rather than automating the end-to-end flow. By starting manually, you will validate that there is a recurring need and only then should you focus on productionalizing the process.</p><p>As Kent Beck famously said, &#8220;Make it work, make it right, make it fast.&#8221; Here&#8217;s how to put that into action:</p><ul><li><p>Make it work: Validate the need manually.</p></li><li><p>Make it right: Productionalize and design for maintainability and scalability, and see how it fits in with your other data flows.</p></li><li><p>Make it fast: Optimize, reduce costs, and reduce latency after you have real validation that the output is being used.</p></li></ul><p>Despite there being obvious wins in optimizing the data warehouse, the real win comes from getting crystal clear on the requirements and having a deeper understanding of the tradeoffs.</p><p>Unlike other software engineering disciplines, data has much more inertia: it&#8217;s difficult to remove something once it&#8217;s built. If you haven&#8217;t gone through the exercise of validating the need first, it may be a lot easier to not build the thing in the first place &#8212; especially if there&#8217;s a chance it balloons into 10,000 tables.</p>]]></content:encoded></item><item><title><![CDATA[The Real Cost of Redshift Warehousing]]></title><description><![CDATA[Insights From Clients and Hands-on Testing]]></description><link>https://blog.twingdata.com/p/the-real-cost-of-redshift-warehousing</link><guid isPermaLink="false">https://blog.twingdata.com/p/the-real-cost-of-redshift-warehousing</guid><dc:creator><![CDATA[Adam Kabak]]></dc:creator><pubDate>Tue, 28 Jan 2025 18:23:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5wx5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5wx5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5wx5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png 424w, https://substackcdn.com/image/fetch/$s_!5wx5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png 848w, https://substackcdn.com/image/fetch/$s_!5wx5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png 1272w, https://substackcdn.com/image/fetch/$s_!5wx5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5wx5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png" width="1016" height="510" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:510,&quot;width&quot;:1016,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1049770,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5wx5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png 424w, https://substackcdn.com/image/fetch/$s_!5wx5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png 848w, https://substackcdn.com/image/fetch/$s_!5wx5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png 1272w, https://substackcdn.com/image/fetch/$s_!5wx5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03199995-7eb8-4212-9b99-f76d4640c9ce_1016x510.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4></h4><p>At Twing Data, we take a holistic view of our clients&#8217; data stacks. This requires the ability to understand, debug, and diagnose many data systems. The data warehouse is a major focus, as it can determine whether a data system meets or falls short of a company&#8217;s needs in terms of performance and budget. The pricing models of these modern warehouse solutions can be nuanced, where a misunderstanding in configuration can potentially go forever unnoticed.</p><p>In this three-part series, we&#8217;ll take a focused look at three major data warehouse solutions, Redshift, BigQuery, and Snowflake. In part one, we&#8217;ll focus on <strong>Amazon Redshift</strong>. As Amazon's flagship warehouse offering, Redshift introduced many to the possibilities of cloud-based data warehousing, but its pricing model and configuration options demand careful consideration for optimal performance and cost management.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h4><strong>Understanding Redshift's Pricing Model</strong></h4><p>Redshift offers a configurable pricing structure that can be as powerful as it is complex. Pricing is divided into two main components: <strong>storage</strong> and <strong>compute</strong>, with options that cater to both smaller datasets and large-scale enterprise-level use cases. Three main system options are available:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HXzD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HXzD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png 424w, https://substackcdn.com/image/fetch/$s_!HXzD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png 848w, https://substackcdn.com/image/fetch/$s_!HXzD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png 1272w, https://substackcdn.com/image/fetch/$s_!HXzD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HXzD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png" width="641" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:641,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49300,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HXzD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png 424w, https://substackcdn.com/image/fetch/$s_!HXzD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png 848w, https://substackcdn.com/image/fetch/$s_!HXzD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png 1272w, https://substackcdn.com/image/fetch/$s_!HXzD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadfba8d4-c97e-467b-b440-beeb73eae53c_641x330.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>DC2 Node Cluster</strong></p><p>Storage and compute are coupled, leading to less flexibility in system configuration and higher overhead for resizing.</p><p>The case for DC2 Nodes is increasingly harder and harder to make, as reserved RA3 instances and serverless systems better suit modern data warehousing needs (ie: high variability in compute with independently-scalable storage)</p><p><strong>RA3 Node Cluster</strong></p><p>The most flexible option for both regular and variable workloads, this model offers scaling by adjusting node size (elastic resizing), cluster size (node count), and number of concurrent clusters</p><p><strong>Redshift Serverless</strong></p><p>This is a fully-managed data system, available on demand, that automatically scales to accommodate current workloads. With no scaling or cluster uptime to manage, Serverless offers the shortest path to reliable results. While compute costs are only incurred while a workload is active, the optionality comes at a premium price not readily detectable in the docs. Lack of scaling control and the minimum charge policy (see <strong>Bonus: Death by 1000 Minimum Charges</strong> below) can result in significantly higher compute costs overall. It&#8217;s best for companies that:</p><ul><li><p>Are uncertain about the cadence of their workloads</p></li><li><p>Have irregular workloads</p></li><li><p>Lack the peoplepower, technical acuity, or desire to manage an RA3                                  system</p></li></ul><p>Coupled with the above configurations, AWS offers the option to reserve capacity for Redshift, committing to usage for a set period (e.g. 1 or 3 years) in exchange for a discounted rate. The discount scales with the length and value of the commitment, making it an attractive option for many organizations. Reserved capacity applies to various Redshift configurations, including RA3 nodes and serverless compute. However, this approach requires accurate forecasting of usage levels to maximize cost savings. For most growing companies, reserving capacity can help mitigate the resources needed for actively managing cluster uptime, while also incentivizing efficient workload distribution to optimize utilization.</p><div><hr></div><h4><strong>Inherent Challenges with RA3</strong></h4><p>With great flexibility comes great complexity, and oversights can be costly. Redshift RA3 clusters are billed based on node size and uptime. Whether it&#8217;s at max capacity or sitting idle, AWS charges for that time. How much? That depends on the node/cluster size, as well as the number of concurrent clusters running. All options remaining equal, this essentially hinges on how fast you need data at times of peak usage. Even when factoring in the discount for reserving nodes, a persistently-running cluster with irregular heavy usage can be expensive if not closely managed. Redshift does offer a few automated or potentially-automated features to help &#8211; but be aware of a few gotchas with each:</p><p><strong>Short Query Acceleration (SQA) - automated</strong></p><ul><li><p>Queries are classified as short or long based on a predicted runtime threshold. Short queries are prioritized, ensuring the maximum number of workloads are completed in a given timeframe</p></li></ul><p><strong>SQA Gotchas:</strong></p><ul><li><p>Prioritizing one group of queries means deprioritizing others. When execution time does not factor into the priority equation, this could lead to longer queries being perpetually deprioritized (ie: thrashing).</p></li><li><p>The threshold is dynamic by default - it changes based on machine learning using real usage. Might require experimenting with hard values.</p></li></ul><p><strong>Auto Workload Management (Auto WLM) - automated</strong></p><ul><li><p>General workload management system dynamically adjusts resource allocations within the cluster depending on current query load</p></li></ul><p><strong>Auto WLM Gotchas:</strong></p><ul><li><p>Not great for predictable, static workloads where fine-tuning resources is possible.</p></li><li><p>Requires an additional layer of management in Query Monitoring Rules (QMR), which govern the thresholds used by Auto WLM.</p></li><li><p>Critical queries that require guaranteed resources are likely hard to target using QMR.</p></li></ul><p><strong>Concurrency Scaling - automated</strong></p><ul><li><p>An additional temporary cluster is automatically provisioned to handle the overflow of queued queries during peak usage. When the overflow is handled, it shuts down. Redshift offers the first hour of concurrency scaling free, per day per cluster (this should be plenty to cover peak usage times).</p></li></ul><p><strong>Concurrency Gotchas:</strong></p><ul><li><p>If the free hour is exceeded, the concurrency scaling cluster is subject to the same per second charge, with 60-second minimum charge for each activation.</p></li></ul><p>The general theme, as ever, is that the more complexity is hidden, the greater the potential sacrifice in speed and/or cost. Even concurrency scaling, which can be completely free if the system is well-optimized, can end up doubling compute costs if it&#8217;s not. Either way, tight monitoring is essential; we&#8217;ve seen situations with our customers where concurrency scaling simply stopped working without any changes to the data warehouse and complete silence from AWS.</p><p>In addition to concurrency scaling, other scaling options to be aware of involve changing the node type, size or count within the main cluster, and come in two flavors, each to be used in specific contexts:</p><p><strong>Elastic/Classic Resize - schedulable, automation possible</strong></p><ul><li><p>Features used for adjusting node type, size, or count. Out of the box, these are either triggered manually or scheduled, but, using the API, the operation can be integrated with monitoring for full automation. Elastic is faster but more limited in terms of types of scaling possible. If elastic doesn&#8217;t serve your purpose, classic can handle more complex changes.</p></li></ul><p>Resizing can be a blog topic on its own, and beyond the scope of this work. A docs link will have to suffice: <a href="https://docs.aws.amazon.com/redshift/latest/mgmt/resizing-cluster.html">https://docs.aws.amazon.com/redshift/latest/mgmt/resizing-cluster.html</a></p><div><hr></div><h4><strong>Saving Data, Losing Dollars</strong></h4><p>Up to now, we haven&#8217;t touched on details around storage management. If you&#8217;re already sold on serverless, you&#8217;re in luck - storage scales 100% independently of compute and is charged per GB/month. If RA3 is more your flavor, and you have or expect to have significant amounts of data, get ready for more to manage!</p><p>RA3 nodes of different sizes have corresponding compute and storage stats. Redshift Managed Storage (RMS) capacity ranges from 1-128 TB depending on the node size. Due to the hard limits on RMS capacity, both compute and storage needs must be factored into the decision around node size/count. While storage is technically billed separately (per GB/month), these storage limits essentially manifest in a partial coupling of compute and storage billing. Maxing out the size and number of nodes in a cluster, the storage limits can reach petabyte scale. But as nodes are added and sized according to max RMS needs, uptime costs increase. Given modern companies collect increasingly more and more data, it&#8217;s no wonder solutions that completely decouple compute and storage, like Redshift Serverless or Snowflake, are so popular. <br><br>The only independent scaling solution for RA3 cluster users is to store certain data in S3 and access as needed using Redshift Spectrum. While there are cost and efficiency sacrifices compared to RMS, they are typically small if handled correctly. Here are some suggestions:</p><ul><li><p>Avoid data transfer fees by storing data in the same location as your cluster</p></li><li><p>Experiment with format/compression types to determine the most efficient</p></li><li><p>Partition data by frequently-filtered columns</p></li></ul><p>At Twing Data, we test your flows with various file/compression formats as part of our system optimization service.</p><div><hr></div><h4><strong>Bonus: Death by 1000 Minimum Charges</strong></h4><p>Consider two scenarios, in which two companies with unique needs choose their systems wisely:</p><p><strong>Scenario 1:</strong> Out of uncertainty in data system requirements, you&#8217;ve decided not to reserve your new company&#8217;s RA3 instance. But you&#8217;ve compensated by painstakingly studying, testing and optimizing its configs, and have seen significant cost savings and efficiency gain.</p><p><strong>Scenario 2:</strong> Given your growing company&#8217;s sizable task backlog and fast-changing data needs, you decide to forgo the complexity of managing a cluster and use Redshift Serverless, which has allowed your small team to build a data system quickly while still focusing on business-critical items.</p><p>Similar to these scenarios, at Twing Data we frequently see talented teams making sound decisions validated by positive outcomes. Yet they&#8217;re still likely overpaying on their data warehouse. One reason we frequently see is Redshift&#8217;s minimum charge policy, a nasty little detail that deserves much more consideration than it usually gets. Non-reserved RA3 clusters and Serverless options both have similar per-workload and per activation minimum charge policies. To manage around them, each requires a unique approach.</p><h4><em>Policy: 60-second minimum per activation, and charged per second of uptime after that.</em></h4><p><strong>RA3:</strong> Since charges are based on cluster uptime, if your instance is not reserved, it is likely because you plan on managing cluster uptime based on less-frequent, periodic access. The risk here is triggering deactivation/reactivation within a minute, causing multiple 1-minute charges in that time. If possible, designate downtimes explicitly and orchestrate workloads accordingly.</p><p><strong>Serverless:</strong> Charges are based on resources in use only. These systems manage uptime automatically, shutting down after 1 hour of idle time. The docs are very unclear about how minimum charges apply to serverless instances, so I&#8217;ll clarify based on our experience:</p><ul><li><p>The 60-second minimum charge applies to every minute in which a workload is run according to the base capacity of the instance</p></li><li><p>The per-second charge applies to the amount of seconds within the minute a certain number of RPUs (Redshift Processing Units) were used.</p></li><li><p>You are charged for at least the full billable minutes at the selected base (minimum) capacity of your instance for each minute any workload is running. It&#8217;s possible to exceed the minimum charge if capacity is scaled to meet higher demand within that minute.</p></li></ul><p>Due to this policy, serverless systems are best suited for data systems that see irregular usage, as the heavy-handed minimum charge policy is offset by the lack of compute overhead when the system is idle.</p><p>Another thing to note about the serverless system is that 1 hour of idle time will trigger a deactivation event. If your company frequently goes &gt; 1 hour without a new workload, the serverless instance will timeout. In our experience, when the system reactivates for the next workload, it introduces up to 10 seconds of latency. Remember, idle time is uncharged in a serverless setup, so it&#8217;s only a disadvantage for the instance to deactivate. We&#8217;ve seen companies avoid instance deactivation by scheduling a simple query, such as &#8220;SELECT 1;&#8221; every 55 minutes or so; it&#8217;s a simple tradeoff to pay for an occasional minute of base execution to avoid latency issues.</p><div><hr></div><h4><strong>Takeaways</strong></h4><p>Amazon Redshift is a capable and versatile data warehouse, with options for a broad range of requirements. And, it should be noted, if already using other AWS services, the benefits of using Redshift, such as the Aurora/Redshift zero-ETL integration, are significant. However, the complexity involved in selecting, monitoring and maintaining an optimal system is high, with the potential for massive unnecessary expense. In the next part of this series, we&#8217;ll explore <strong>Google BigQuery</strong>, which takes a completely different approach to pricing and resource allocation.</p><p>For more on optimizing Redshift, we&#8217;ll soon release a breakdown between sort/dist keys and compare them with partitioning and clustering columns used by other modern cloud warehouses. Keep an eye out!</p><p>At Twing Data, we understand that no two businesses are alike, and neither are their data warehouse needs. With our expertise in query metadata analysis, organizations gain visibility into their Redshift usage patterns and the knowledge needed to make informed decisions. Reach out to learn how we can help you uncover hidden costs and turn insights into actionable savings.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.twingdata.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Twing Data! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Exploring Redshift’s Approach to Query Hashing]]></title><description><![CDATA[We dug into Redshift's query hashing to see how it stacked up against Snowflake's approach. Spoiler alert: Redshift came out ahead.]]></description><link>https://blog.twingdata.com/p/exploring-redshifts-approach-to-query</link><guid isPermaLink="false">https://blog.twingdata.com/p/exploring-redshifts-approach-to-query</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Thu, 23 Jan 2025 18:50:45 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7f641fd5-1967-41c3-8efd-d8705a0b7a6c_410x262.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few weeks ago, <a href="https://blog.twingdata.com/p/same-query-different-hash-an-exploration">we looked at examples</a> where Snowflake&#8217;s query_parameterized_hash failed to identify simple variants of a query pattern as the same query. This week, we did the same exercise for Redshift expecting roughly the same behavior &#8211; but we were pleasantly surprised. Redshift&#8217;s query hashing is much more thoughtful than Snowflake&#8217;s: it&#8217;s clear they actually parse and analyze the query plan in order to generate the query hash.</p><p>There&#8217;s a hint that they take a more thoughtful approach in their <a href="https://docs.aws.amazon.com/redshift/latest/dg/SYS_QUERY_HISTORY.html">docs</a>, where they mention that &#8220;select * from table&#8221; and &#8220;select id from table&#8221; will generate the same hash if that table has a single column. That wouldn&#8217;t be possible from looking solely at the query text without a deeper understanding of the table structure.</p><p>Now on to the test cases:</p><h3>Single values comparisons</h3><pre><code>SELECT a from (select 1 as id, 'a' as a) b where id in (2);
SELECT a from (select 1 as id, 'a' as a) b where id in (1);
SELECT a from (select 1 as id, 'a' as a) b where id = 2;
SELECT a from (select 1 as id, 'a' as a) b where id = 1;</code></pre><p>In this case, Redshift correctly recognized that these were the same query, unlike Snowflake which only correctly handled the strict equality condition. But as soon as we expanded it to &#8220;where id in (1, 2),&#8221; Redshift generated a different query hash..</p><h3>Table aliasing</h3><pre><code>SELECT name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1;
SELECT d.name from (select 1 as id, 'a' as a, 'Test' as name) as d where d.id = 1;
SELECT c.name from (select 1 as id, 'a' as a, 'Test' as name) as c where c.id = 1;
SELECT b.name from (select 1 as id, 'a' as a, 'Test' as name) as b where b.id = 1;
SELECT name from (select 1 as id, 'a' as a, 'Test' as name) b where id = 1;</code></pre><p>Redshift detected that these were all the same table and gave each the same hash. Snowflake, on the other hand, generated different query hashes for each.</p><h3>Table qualification</h3><pre><code>SELECT name from customers where customer_id = 1;
SELECT name from public.customers where customer_id = 1;
SELECT name from testing.public.customers where customer_id = 1;</code></pre><p>Unlike Snowflake, Redshift detected that these queries were all using the same table and gave each the same hash.</p><h3>Column aliasing</h3><pre><code>select name as other_name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1;
select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1;</code></pre><p>In this case, Redshift generated different query hashes despite the structure of the query being identical.</p><h3>Column order</h3><pre><code>select city, name from (select 1 as id, 'a' as city, 'Test' as name) where id = 1;
select name, city from (select 1 as id, 'a' as city, 'Test' as name) where id = 1;</code></pre><p>Similarly, here, Redshift didn&#8217;t generate the same query hash here despite the same data being returned.</p><h3>CTE order</h3><pre><code>with d as (select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 2),
     c as (select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1)
     select c.name, d.name from c join d on 1=1;

with c as (select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 1),
     d as (select name from (select 1 as id, 'a' as a, 'Test' as name) where id = 2)
     select c.name, d.name from c join d on 1=1;</code></pre><p>Once again, Redshift had trouble with the order.</p><p>Overall, Redshift did a better job of parsing compared to Snowflake, correctly mapping the cases where the differences were in the table structure (aliasing) &#8211; but didn&#8217;t do as good of a job for column renaming, or when the the clauses were reordered.</p><p>Here&#8217;s a quick side by side comparison of Redshift vs Snowflake for all the test cases.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CcsG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CcsG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png 424w, https://substackcdn.com/image/fetch/$s_!CcsG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png 848w, https://substackcdn.com/image/fetch/$s_!CcsG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png 1272w, https://substackcdn.com/image/fetch/$s_!CcsG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CcsG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png" width="908" height="410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:410,&quot;width&quot;:908,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57039,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CcsG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png 424w, https://substackcdn.com/image/fetch/$s_!CcsG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png 848w, https://substackcdn.com/image/fetch/$s_!CcsG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png 1272w, https://substackcdn.com/image/fetch/$s_!CcsG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fc42c08-af57-4a5c-b7b2-779664c6457c_908x410.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p>]]></content:encoded></item><item><title><![CDATA[Start Tracking Your Unit Costs to Avoid Surprise Cloud Bills]]></title><description><![CDATA[Two key practices for managing cloud infrastructure costs, with real examples from building and scaling an advertising exchange]]></description><link>https://blog.twingdata.com/p/start-tracking-your-unit-costs-to</link><guid isPermaLink="false">https://blog.twingdata.com/p/start-tracking-your-unit-costs-to</guid><dc:creator><![CDATA[Dan Goldin]]></dc:creator><pubDate>Thu, 26 Dec 2024 23:00:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Aa14!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Aa14!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Aa14!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png 424w, https://substackcdn.com/image/fetch/$s_!Aa14!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png 848w, https://substackcdn.com/image/fetch/$s_!Aa14!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!Aa14!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Aa14!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png" width="1400" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d60de923-83a5-4840-a300-0304928ef257_1400x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2022767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Aa14!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png 424w, https://substackcdn.com/image/fetch/$s_!Aa14!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png 848w, https://substackcdn.com/image/fetch/$s_!Aa14!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png 1272w, https://substackcdn.com/image/fetch/$s_!Aa14!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd60de923-83a5-4840-a300-0304928ef257_1400x800.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The current crop of cloud infrastructure is really good at a couple things: getting you up and running quickly, and getting you to spend a lot of money.</p><p>It&#8217;s a great boost to productivity when you can get your AWS or Snowflake instance humming in minutes, but it&#8217;s easy to forget that there are actual computers &#8211; with nebulous pricing and processes &#8211; behind the scenes. The costs of managing data and infrastructure has left many a finance team scratching their heads.</p><p>There are horror stories of surprise bill (a <a href="https://aws.plainenglish.io/how-i-tackled-a-shocking-90-000-aws-bill-and-what-you-can-learn-from-it-e93446bf9b9d">$90,000 AWS bill</a>, an <a href="https://www.reddit.com/r/snowflake/comments/x2t8qn/over_8k_in_accidental_snowflake_charges/">$8,000 Snowflake charge</a>) that were dramatically higher than people expected. It&#8217;s easy to blame the data engineer or team who incurred the cost, but part of the issue is the design of the tools themselves. They&#8217;re not incentivized to safeguard your organization against wasteful spending&#8212;but there are a couple best practices you can implement to prevent it.</p><p>There are two things you should to get ahead of any unpleasant surprises:</p><ol><li><p><strong>Tag everything</strong>. What is the environment of this resource? What is the service or application that is consuming the resource? If it&#8217;s a query which API or users are triggering the request?</p></li><li><p><strong>Track your unit costs</strong>. Most services you run should be correlated to some unit of your business. If you run an ecommerce business, your unit cost may be orders or customers. If it&#8217;s video streaming, your costs are going to be correlated with hours streamed and active users. If it&#8217;s healthtech, it might be a function of patient records or generated reports.</p></li></ol><p>The key is to understand what&#8217;s the primary driver of your service costs and start tracking the unit costs over time. As things get more complex, you&#8217;ll also need to have different unit costs depending on the service. You can break this down into units for different systems. The ultimate goal is to get to apples-to-apples comparisons on a recurring basis, to see if there are any significant changes. If anything looks off, you can then use the tagged costs to dig into those items.</p><p>At TripleLift, we refined a cost tracking system over multiple years that became increasingly automated over time. Since we were running an ad exchange, we knew that the majority of our infrastructure costs was a function of the number of bid requests we sent when running an ad auction. Our data pipeline costs, on the other hand, were correlated with the number of auctions run. By breaking these down and looking at trends over time, we could track patterns and dig into anomalies.</p><p>For example, when we introduced a new type of data job or report, we would see an increase in unit costs for our data pipeline. On the other hand, when we optimized our Spark clusters to use Spot instances, we saw a decrease in unit costs. Without this visibility, it would have been incredibly difficult to understand what was happening.</p><p>The benefits of this sort of tracking compound over time. You can start accurately forecasting your infrastructure costs. You&#8217;ll instill a culture of discipline and efficiency with your organization. And, no promises, but you may even score some hard-won brownie points with your finance team.</p><p>Here&#8217;s the <a href="https://www.linkedin.com/feed/update/urn:li:activity:7275248036342034433?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7275248036342034433%2C7275255780893757441%29&amp;dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287275255780893757441%2Curn%3Ali%3Aactivity%3A7275248036342034433%29">LinkedIn thread</a> that inspired this post.</p>]]></content:encoded></item></channel></rss>