Twing Data

Twing Data

Share this post

Twing Data
Twing Data
Handling Personal Data Deletion in AdTech Data Systems
User's avatar
Discover more from Twing Data
Newsletter from Twing Data which will contain product and feature announcements as well as thoughts on the data world.
Already have an account? Sign in

Handling Personal Data Deletion in AdTech Data Systems

We’re outlining four approaches to data deletion and evaluating tradeoffs between them.

Dan Goldin's avatar
Dan Goldin
May 14, 2025
1

Share this post

Twing Data
Twing Data
Handling Personal Data Deletion in AdTech Data Systems
Share

In many software engineering disciplines, data can be ephemeral. Say you have a buggy API — you can deploy a new version and the past is forgotten. But data engineering has a lot more state, and is far more sticky as a result. Once you collect some data, it’s likely that it will stick around for years and you’re going to have to support it.

A common challenge every company has to deal with is personal data. As a consumer, you want to make sure companies are protecting your data, or ideally not even storing it in the first place. Various regions, countries, states, etc have their own regulations. Europe has GDPR, California has the CCPA, Canada has PIPEDA, Brazil has LGPD, and so on.

Adtech data tends to not be as sensitive as others — we’re not dealing with government identities or payment information, after all — but there is the matter of user browsing data. For example, if you’re using a browser that supports third party cookies (like Chrome), your behavior can be tracked from site to site, and companies pay to generate a targeting profile for you. If you’re using a browser that only supports first party cookies (like Safari) you can’t be tracked as easily across sites, but your browsing behavior is still captured within the same site.

For this reason, the aforementioned regulations often have a requirement around data deletion requests. If you’re storing this personal data, you have an obligation to delete it when asked. The types of data that qualifies as personal data varies between companies and industries, but for adtech it’s typically the user id, IP address, user agent — anything that allows you to identify and track individual users.

Going back to deletions, in a typical database, the deletion would be a simple “DELETE FROM TABLE WHERE user_id = ‘abc’” — but adtech data isn’t stored in a simple database. Instead, we have hundreds of billions of records that have some personal data captured each day and for cost reasons usually stored on S3 in some compressed and analytics-optimized format, such as Parquet.

So what do you do when a user wants to delete their data? The cookie that’s stored in the browser is easy to delete, but what do you do with the trillions of rows scattered across these Parquet files going back for months? You can run the “DELETE FROM TABLE .. “ command above, but what it will do is read the terabytes/petabytes of data into memory, filter and remove the relevant rows, then write it back to S3. It’s doable, but it’s slow and incredibly expensive. Think on the order of thousands or tens of thousands of dollars for a single user deletion request. It’s not obvious how to do it.

Four Approaches To Data Deletion

Don’t store the data at all

What’s not stored doesn’t need to be removed. But that comes at a cost: the data is needed to run a variety of analyses. Being able to count the distinct numbers of users, calculate trends, or use the data to cluster users to create these advertiser audiences in order to improve your ad targeting and hopefully improve the return of your ad spend.

Best for: Strict compliance and lowest storage and compute cost.

Tradeoff: The data is not available for analysis.

Store the data for a limited time

Alternatively, store the data for the minimum time you need, and then purge the rest. There’s still a cost to the removal, but if you only retain this data for a week then your deletion query only needs to run for a week’s worth of data.

We can extend the above to be even better by storing the personal data separately. Instead of storing the personal data in the same files as the rest we can split it into multiple file types with their own retention policies. That way the personal data can have a shorter retention window and you avoid having to run the deletion request on the full dataset. This allows the non-personal data to still be used for analysis and investigations without incurring the cost of storing the personal data.

Best for: Having some personal data available with low operational costs.

Tradeoff: Deletion still costs money, but the cost and likelihood of needing to run a deletion is low.

Store hashed values instead of original values

With this approach, the data isn’t tied back to the original user, but you are still able to run the bulk of the analysis you did originally. You still delete the cookie in the browser, but at that point the link is broken and you can let the data naturally phase out based on your retention policies. The key here is that if a deletion request comes in, you only have to remove it from the browser. No need to worry about the data stored in S3.

Best for: Having long term pseudonymized personal data available for a subset of analyses (counts, discount counts, various grouping).

Tradeoff: You cannot easily map this data back to the original user id, which limits the ability to associate properties with users — for example, associating users with a specific advertising audience.

Use a mapping table

The challenge with the one way hash approach is that it’s still deterministic. If you know what the user’s cookie id was and the hashing approach, you can rederive the hashed value. But we can make the association truly random by generating a completely random id, then using a mapping table to connect the cookie id to a random id that will be what’s stored in the S3 logs. When a user submits a deletion request, we purge the mapping entry, breaking the link similar to the prior approach, also letting it phase out naturally with the retention policy. Similar to the solution above, we remove it from a much cheaper and quicker data store — the mapping table — without having to touch anything on S3. As part of this step, we can also do full anonymization on the fields we don’t need details for; for example, just mapping user agent to a browser version, and stripping the last octet of the IP address.

Best for: Being able to do advanced and robust analysis and audience creation.

Tradeoff: This is more operationally complex and costly than the others.

For each of these strategies, it’s all about evaluating tradeoffs. Product and data science teams typically want as much data as they can get, since it’s what they use to innovate. On the other hand, legal and compliance teams want to limit the risk. And the data engineering teams are charged with walking that tightrope, all while keeping costs down so they can work on other problems. It’s no easy feat, but we hope these approaches help you think through the best approach for your team.


Subscribe to Twing Data

Launched 2 years ago
Newsletter from Twing Data which will contain product and feature announcements as well as thoughts on the data world.
Adam Kabak's avatar
1 Like
1

Share this post

Twing Data
Twing Data
Handling Personal Data Deletion in AdTech Data Systems
Share

Discussion about this post

User's avatar
Building an open data pipeline in 2024
Using Iceberg allows us to pick the optimal "big data" compute environment for the specific requirements we have. There's no need to limit yourself to a…
Apr 26, 2024 • 
Dan Goldin
10

Share this post

Twing Data
Twing Data
Building an open data pipeline in 2024
Identify unused columns in Snowflake and other data warehouses
Identify unused columns in your data warehouse to reduce cost and improve performance. We provide two ways - one using a Snowflake query and the other a…
Mar 14, 2024 • 
Dan Goldin
1

Share this post

Twing Data
Twing Data
Identify unused columns in Snowflake and other data warehouses
Embrace the Differences Between Development and Production Environments for Data Engineering
Rather than try to align development and production environments for data engineer we should instead move to a world where SQL is the universal…
Apr 11, 2024 • 
Dan Goldin
1

Share this post

Twing Data
Twing Data
Embrace the Differences Between Development and Production Environments for Data Engineering

Ready for more?

© 2025 Twing Data, Inc
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.