Building and Operating a Pretty Big Storage System Called S3

Building And Operating a Pretty Big Storage System Called S3

The first hard drive was built by Jacob Rabinow, who was a researcher for the predecessor of the National Institute of Standards and Technology (NIST).

Rabinow was an expert in magnets and mechanical engineering, and he’d been asked to build a machine to do magnetic storage on flat sheets of media, almost like pages in a book. He decided that idea was too complex and inefficient, so, stealing the idea of a spinning disk from record players, he built an array of spinning magnetic disks that could be read by a single head.

The first commercially available hard disk appeared 7 years later in 1956, when IBM introduced the 350 disk storage unit, as part of the 305 RAMAC computer system.

Today, consumer devices are effectively all solid-state, and a large amount of enterprise storage is similarly switching to SSDs. Jim Gray predicted this direction in 2006, when he very presciently said: “Tape is Dead. Disk is Tape. Flash is Disk. RAM Locality is King.“

Hard disks don’t fill the role of general storage media that they used to because they are big (physically and in terms of bytes), slower, and relatively fragile pieces of media. For almost every common storage application, flash is superior.

Imagine a hard drive head as a 747 flying over a grassy field at 75 miles per hour. The air gap between the bottom of the plane and the top of the grass is two sheets of paper. Now, if we measure bits on the disk as blades of grass, the track width would be 4.6 blades of grass wide and the bit length would be one blade of grass. As the plane flew over the grass it would count blades of grass and only miss one blade for every 25 thousand times the plane circled the Earth.

Since the RAMAC, capacity has improved 7.2M times over, while the physical drive has gotten 5,000x smaller. It’s 6 billion times cheaper per byte in inflation-adjusted dollars. But despite all that, seek times — the time it takes to perform a random access to a specific piece of data on the drive — have only gotten 150x better. Why? Because they’re mechanical.

If we do a bad job of managing heat, then we end up focusing a disproportionate number of requests on a single drive, and we create hotspots because of the limited I/O that’s available from that single disk.

Hotspots are small numbers of overloaded drives in a system that ends up getting bogged down, and results in poor overall performance for requests dependent on those drives. When you get a hot spot, things don’t fall over, but you queue up requests and the customer experience is poor. Unbalanced load stalls requests that are waiting on busy drives, those stalls amplify up through layers of the software storage stack, they get amplified by dependent I/Os for metadata lookups or erasure coding, and they result in a very small proportion of higher latency requests — or “stragglers”.

In other words, hotspots at individual hard disks create tail latency, and ultimately, if you don’t stay on top of them, they grow to eventually impact all request latency.

The more workloads we run on S3, the more that individual requests to objects become decorrelated with one another.

Individual storage workloads tend to be really bursty, in fact, most storage workloads are completely idle most of the time and then experience sudden load peaks when data is accessed. That peak demand is much higher than the mean. But as we aggregate millions of workloads a really, really cool thing happens: the aggregate demand smooths and it becomes way more predictable.

We obviously don’t want to pay a replication overhead for all of the data that we store, so in S3 we also make use of erasure coding. For example, we use an algorithm, such as Reed-Solomon, and split our object into a set of k “identity” shards. Then we generate an additional set of m parity shards. As long as k of the (k+m) total shards remain available, we can read the object. This approach lets us reduce capacity overhead while surviving the same number of failures.

One of the core tenets at Amazon is that we want engineers and teams to fail fast, and safely. We want them to always have the confidence to move quickly as builders, while still remaining completely obsessed with delivering highly durable storage. One strategy we use to help with this in S3 is a process called “durability reviews.” It’s a human mechanism that’s not in the statistical 11 9s model, but it’s every bit as important.

From a verification perspective, we built a simplified model of ShardStore’s logic, (also in Rust), and checked into the same repository alongside the real production ShardStore implementation. This model dropped all the complexity of the actual on-disk storage layers and hard drives, and instead acted as a compact but executable specification. It wound up being about 1% of the size of the real system, but allowed us to perform testing at a level that would have been completely impractical to do against a hard drive with 120 available IOPS. We even managed to publish a paper about this work at SOSP.

The really cool bit of this work wasn’t anything to do with either designing ShardStore or using formal verification tricks. It was that we managed to kind of “industrialize” verification, taking really cool, but kind of research-y techniques for program correctness, and get them into code where normal engineers who don’t have PhDs in formal verification can contribute to maintaining the specification, and that we could continue to apply our tools with every single commit to the software.

Date

November 30, 2023

Up next

Good Poems Prev Source Next Source Life is some kind of loathsome hag Who is forever threatening to turn beautiful

Previously

Against Interpretation Prev Source Next Source THE earliest experience of art must have been that it was incantatory, magical; art was an instrument of ritual. (Cf. the