Tactics for handling high cardinality, high dimensionality, timeseries data
A survey of a few different approaches for large scale data challenges, with examples from the adtech industry.
One challenge faced by many companies is how to handle high cardinality, high dimensional, time series data. In this post, we will survey a few different approaches to address this issue, using the advertising industry as an example since that’s what we’re familiar with, although these methods can be applied to other industries as well.
A bit about AdTech
In adtech, it is essential to track various metrics related to the ad auction, the rendering, and the user interaction. To capture as much information as possible about these events, we gather data from different sources. This includes user device information (browser, operating system), website details (domain, page URL, content classification), user-derived information (IP address for geographic data and unique identifier), and ad-specific information (brand, creative, campaign). As a result, a large number of dimensions are tracked every time an advertising auction is executed and an ad is served, which leads to hundreds of billions of events per day.
Most dimensions have low cardinality, like browser and operating system, but some are extremely high, such as postal code and creative identifier. However, the real challenge lies in the combinations of these dimensions. For instance, if you have X creatives running in Y postal codes, it would result in X * Y combinations. When combined with other dimensions, especially those that are independent, your data can quickly expand. The time dimension adds a whole other level of complexity: imagine doing the above by day, and then aggregating across weeks and months to run time period comparisons.
So, how can you handle these challenges? The following are some approaches I’ve seen to address this issue, but I would also love to hear any other suggestions you may have. As is often the case in the real world, there is no one solution, and you’ll often end up with a portfolio of approaches based on your unique requirements.
Store everything
The simplest approach is to just store everything and deal with the associated cost. When you’re small, it’s not a huge problem and it’s nice to be able to just dump everything you want into a single place and have it be immediately available for analysis and reporting. Unfortunately, this doesn’t scale well and while the load ends up being easy you’ll quickly discover that storage costs become significant, and queries either become slow or expensive
The nice thing is that you can start exploring a variety of different data warehouses to see which ones give you the best tradeoff between cost and performance. You have the traditional data warehouses such as Redshift, Snowflake, and BigQuery but it’s also worth checking out Druid and ClickHouse which are designed especially for this purpose but come with their own challenges around hosting and updating historical data.
Storage optimization
Building on the previous approach, separate storage and compute can help optimize overall efficiency. Especially if you’re thoughtful around how you store the data - both via the formats you use (Parquet, Iceberg) as well as how it’s partitioned (by date, customer, etc) you can get pretty far. This approach proves particularly useful when combined with other techniques, like rolling up.
Roll up
The first step in optimizing large data volumes is often to aggregate the data to reduce cardinality, instead of storing everything. With this approach, you analyze your usage patterns and determine how to consolidate the data. One way is to eliminate infrequently used or unused dimensions entirely, while another is to roll them up to a higher level.
Using the adtech example mentioned earlier, if we observe that no one looks at postal code beyond 30 days, we can create monthly aggregates that exclude the postal code dimension. Similarly, this analysis can be done for all dimensions to see what’s useful to keep and what can be removed, or simply made accessible but in a less optimized way. Time is an example of a dimension that’s rolled up to less granular buckets. Initially, the data may have a specific timestamp, but as it ages, it can be rolled up to the hour, day, week, month, and year levels.
Cardinality Consolidation
Most dimensions have a long tail, but we often care only about the head (eg, major cities, highly populated countries, top domains, biggest advertising campaigns). Combined, tail data is significant but individual rows are hard to analyze, parse, and reason about. One can take advantage of it by grouping tail data together into an “Other” category when aggregating the data. One approach is to keep the values that combine to be 80% of the volume and group the remainder into “Other.” It sounds simple but calculating the percentile is complex since each day will have slightly different values along the cutoff boundary and will lead to messy data. A way to handle that is to do this volume cutoff less frequently and by looking across a longer time period. Also note that if you do go down this road, you’ll inevitably need to come up with a way of excluding values from being masked, even if they are in the tail.
Sampling
The final approach I would like to discuss is data sampling. The simplest way to deal with high data volumes is to simply not capture it all. If feasible for your business, you can implement naive data sampling. For example, collecting 1% of your analytics data can still provide confidence in the results, if you have large-enough volumes. Alternatively, you may find that not all data holds the same value, allowing for more aggressive sampling of lower-value data. The idea here is to collect a percentage of events, such as 1%, and treat each collected event as a representation of 100 events. This makes slicing and dicing less accurate but leads to accurate aggregate results. The tradeoff, however, is that intentionally collecting less data may not be suitable if you frequently need to conduct in-depth analysis of rare events.
I hope that these techniques have offered a brief overview of the available options, but the real value comes from combining them. A mature approach uses all of the above to strike the right balance between cost, performance, and accuracy. This entails implementing intelligent sampling for a subset of your event data and utilizing various roll up processes to aggregate data across different dimensions and time intervals. Simultaneously, retaining raw for a certain period allows for the rare, yet important investigation. The optimal approach is going to be unique to each business but I hope the above are able to act as a good starting point.