A Decade of Major Cache Incidents at Twitter
Another is that there’s a cottage industry of former directors / VPs who tell self-aggrandizing stories about all the great things they did that, to put it mildly, frequently distort the truth (although there’s nothing stopping ICs from doing this, the most spread false stories we see tend to come from people on the management track). In both cases, there’s a kind of Gresham’s law of stories in play, where incorrect stories tend to win out over correct stories.
Every single incident so far has at least mentioned cache. In fact, for a long time, cache was probably the #1 source of bringing the site down for a while.
In my first six months, every time I restarted a cache server, it was a SEV-0 by today’s standards. On a good day, you might have 95% Success Rate (SR) [for external requests to the site] if I restarted one cache …
Conceptually, a cache server is a high-throughput, low-latency RPC server plus a library that manages data, such as memory and/or disk and key value indices.
For in memory caches, the data management side should be able to easily outpace the RPC side (a naive in-memory key-value library should be able to hit millions of QPS per core, whereas a naive RPC server that doesn’t use userspace networking, batching and/or pipelining, etc. will have problems getting to 1/10th that level of performance).
Compared to most workloads, cache is more sensitive to performance anomalies below it in the stack (e.g., kernel, firmware, hardware, etc.) because it tends to have relatively high-volume and low-latency SLOs (because the point of cache is that it’s fast) and it spends (barring things like userspace networking) a lot of time in kernel (~80% as a ballpark for Twitter memcached running normal kernel networking).
Also, because cache servers often run a small number of threads, cache is relatively sensitive to being starved by other workloads sharing the same underlying resources (CPU, memory, disk, etc.). The high volume and low latency SLOs worsen positive feedback loops that lead to a “death spiral”, a classic distributed systems failure mode.