Sizing Big Data Workloads: Key Numbers To Know
A cheat sheet for scoping big data workloads across compute, storage, and IO.
During the course of our work at Twing Data, we’ve handled a wide range of data systems. One involved ingesting hundreds of billions of events through Kafka and writing them to S3. In another project, the emphasis was on real-time latency, but with lower volumes. In another, the core goal was querying terabytes of data stored on S3 in a cost-efficient way.
We put the following table together as a quick reference when architecting a data workload. The reference instance in each case is a 64 vCPU AWS Graviton instance suited for a typical workload and shows the key metrics and notes the first thing that will cause problems. It’s meant for a quick back of the napkin sizing rather than a full capacity plan.
This was a balancing act as there are so many dimensions at play: the details of the workload, the gamut of available instance types, each with their own EBS/NVMe configurations. Trying to capture it all in a single table would be impossible. The goal is to treat this as a starting point, then benchmark and test your workloads before committing to an approach.