Reducing Logging Cost by Two Orders of Magnitude Using Clp

Long, long ago, the amount of data our systems output to logs was small enough that we were able to retain all of the log files. This allowed our engineers to freely analyze the logs, say for troubleshooting our systems or improving applications

But as Uber’s business grew rapidly, the amount of data being logged increased dramatically. And so we were forced to discard log files after just a short period of time, given the prohibitive cost of retaining them

In aggregate, CLP achieves a 169x compression ratio on our log data, saving storage, memory, and disk/network bandwidth at every level. As a result, we can now retain all logs at a fraction of the cost, without throwing away any insights, and the compressed logs can be efficiently searched without decompression

On a busy day, our Spark cluster alone can generate up to 200TB of logs (at the default INFO verbosity level).

At Uber, we rely on making data-driven decisions at every level. For this, we have built a large-scale big data platform that runs over 250,000 Spark analytics jobs per day, where each job could consist of hundreds of thousands of executors, processing over a hundred petabytes of analytical data.

Unfortunately, as our log size continued to grow, things got worse: our SSDs started to burn out prematurely. Log files are typically written a few messages at a time, spread out over the duration of the job. On our memory-constrained systems, this causes many small writes over a long period of time, which in turn causes SSD write-amplification. As a mitigation, we reduced the logging level from INFO to WARN, but this is contrary to what our users want!

Data scientists want to examine old Spark logs to examine and improve the quality of their machine learning applications. As a result, our users would frequently ask our infrastructure team to increase the log retention period from three days (the period at the time) to at least a month.

General purpose compressors are not ideal for our purposes for a few reasons. First, they do not exploit the unique characteristics of logs, resulting in a suboptimal compression ratio. Compared to other textual data, logs are much more repetitive

CLP compresses the logs and allows search directly on the compressed logs. CLP parses the unstructured logs and turns them into a structured table, where each message is a row and the different components including the timestamp, log type, and variables, are different columns. CLP deduplicates those highly repetitive columns using dictionaries and only a dictionary index is stored in the table instead of the raw string. CLP finally compresses this table using Zstandard, but in a column-oriented manner, significantly increasing its capability to find duplication. As a result, CLP has an unprecedented compression ratio (significantly exceeding the capabilities of Gzip or Zstandard alone), while allowing users to search the compressed logs without decompression back to plain-text, as it’s enabled by using the dictionaries as indexes

Logs are primarily written using Log4j, a logging library used by Spark and many other big data platforms

Log4j allows developers to use custom logic to persist log messages, in the form of a Log4j appender

Date

October 7, 2022

Up next

2022-10-14 Recently been thinking about the concept of “heartbeats” in relationships. It seems like, for me, intimate relationships have historically involved

Previously

2022-10-07 I recently went to a small town in upstate New York. I was there to see a play. The play was written by a close friend, and it was the first time