Example Generated Summary From Recent Highlights 2024-06-07

Input Highlights

Come from Brendan Gregg’s Systems Performance

Click here to see highlights (a bit long, so default hidden)
Here’s a sample of 2 of them
  • Off-CPU analysis is the study of threads that are not currently running on a CPU: This state is called off-CPU
  • It includes all the reasons that threads block: disk I/O, network I/O, lock contention, explicit sleeps, scheduler preemption, etc. The analysis of these reasons and the performance issues they cause typically involves a wide variety of tools.
These are all the highlights used in generating a summary, highlights come from reading between 2024-06-04 to 2024-06-07
  • Off-CPU analysis is the study of threads that are not currently running on a CPU: This state is called off-CPU
  • It includes all the reasons that threads block: disk I/O, network I/O, lock contention, explicit sleeps, scheduler preemption, etc. The analysis of these reasons and the performance issues they cause typically involves a wide variety of tools.
  • Off-CPU profiling can be performed in different ways, including: Sampling: Collecting timer-based samples of threads that are off-CPU, or simply all threads (called wallclock sampling). Scheduler tracing: Instrumenting the kernel CPU scheduler to time the duration that threads are off-CPU, and recording these times with the off-CPU stack trace. Stack traces do not change when a thread is off-CPU (because it is not running to change it), so the stack trace only needs to be read once for each blocking event. Application instrumentation: Some applications have built-in instrumentation for commonly blocking code paths, such as disk I/O.
  • The first two approaches are preferable as they work for all applications and can see all off-CPU events; however, they come with major overhead. Sampling at 49 Hertz should cost negligible overhead on, say, an 8-CPU system, but off-CPU sampling must sample the pool of threads rather than the pool of CPUs. The same system may have 10,000 threads, most of which are idle, so sampling them increases the overhead by 1,000x9 (imagine CPU-profiling a 10,000-CPU system). Scheduler tracing can also cost significant overhead, as the same system may have 100,000 scheduler events or more per second.
  • Apart from scheduler events, syscall events are another useful target for studying applications.
  • System calls (syscalls) can be instrumented for the study of resource-based performance issues. The intent is to find out where syscall time is spent, including the type of syscall and the reason it is called.
  • New process tracing: By tracing the execve(2) syscall you can log new process execution, and analyze issues of short-lived processes. See the execsnoop(8) tool
  • I/O profiling: Tracing read(2)/write(2)/send(2)/recv(2) and their variants, and studying their I/O sizes, flags, and code paths, will help you identify issues of suboptimal I/O, such as a large number of small I/O. See the bpftrace tool
  • Kernel time analysis: When systems show a high amount of kernel CPU time, often reported as %sys,” instrumenting syscalls can locate the cause.
  • Syscalls are a well-documented API (man pages), making them an easy event source to study. They are also called synchronously with the application, which means that collecting stack traces from syscalls will show the application code path responsible. Such stack traces can be visualized as a flame graph.
  • the USE method checks the utilization, saturation, and errors of all hardware resources. Many application performance issues may be solved this way, by showing that a resource has become a bottleneck.
  • This is a list of nine thread states I’ve chosen to give better starting points for analysis than the two earlier states (on-CPU and off-CPU): User: On-CPU in user mode Kernel: On-CPU in kernel mode Runnable: And off-CPU waiting for a turn on-CPU Swapping (anonymous paging): Runnable, but blocked for anonymous page-ins Disk I/O: Waiting for block device I/O: reads/writes, data/text page-ins Net I/O: Waiting for network device I/O: socket reads/writes Sleeping: A voluntary sleep Lock: Waiting to acquire a synchronization lock (waiting on someone else) Idle: Waiting for work
  • Performance for an application request is improved by reducing the time in every state except idle. Other things being equal, this would mean that application requests have lower latency, and the application can handle more load.
  • In a distributed environment, an application may be composed of services that run on separate systems. While each service can be studied as though it is its own mini-application, it is also necessary to study the distributed application as a whole. This requires new methodologies and tools, and is commonly performed using distributed tracing.
  • Distributed tracing involves logging information on each service request and then later combining this information for study.
  • A challenge with distributed tracing is the amount of log data generated: multiple entries for every application request. One solution is to perform head-based sampling where at the start (“head”) of the request, a decision is made whether to sample (“trace”) it: for example, to trace one in every ten thousands requests. This is sufficient to analyze the performance of the bulk of the requests, but it may make the analysis of intermittent errors or outliers difficult due to limited data. Some distributed tracers are tail-based, where all events are first captured and then a decision is made as to what to keep, perhaps based on latency and errors.
  • perf CPU profiling, CPU flame graphs, syscall tracing
  • profile CPU profiling using timed sampling
  • offcputime Off-CPU profiling using scheduler tracing
  • strace Syscall tracing
  • execsnoop New process tracing
  • syscount Syscall counting
  • bpftrace Signal tracing, I/O profiling, lock analysis
  • perf(1) is the standard Linux profiler, a multi-tool with many uses.
  • perf record -F 49 -a -g — sleep 30
  • profile(8)11 is timer-based CPU profiler from BCC (Chapter 15). It uses BPF to reduce overhead by aggregating stack traces in kernel context, and only passes unique stacks and their counts to user space.
  • offcputime(8)12 is a BCC and bpftrace tool (Chapter 15) to summarize time spent by threads blocked and off-CPU, showing stack traces to explain why.
  • offcputime(8) is the counterpart to profile(8): between them, they show the entire time spent by threads on the system.

Generated Summary

# I. Off-CPU Analysis  
## A. Off-CPU analysis refers to the practice of analyzing the performance of threads or processes when they are not actively executing on a CPU, which is known as the off-CPU state. Threads or processes can enter the off-CPU state due to various reasons, such as disk I/O operations, network communication, lock contention (waiting to acquire a lock), explicit sleep or wait calls in the code, or the operating system's scheduler deciding to run another thread or process instead.  
## B. There are several approaches to profiling off-CPU behavior:  
1. Sampling involves periodically checking which threads are in the off-CPU state and collecting stack traces, which show the code path that led to the off-CPU state. While sampling can provide valuable insights, there is a potential overhead associated with it, as frequently sampling many idle threads can increase the overall system overhead and potentially affect the application's performance.  
2. Scheduler tracing involves instrumenting the kernel's CPU scheduler to record when threads go off-CPU and for how long they remain in that state. This approach also involves recording off-CPU stack traces to understand the code paths that led to the off-CPU state. However, scheduler tracing can have significant overhead due to the high frequency of scheduler events that need to be recorded.  
3. Application instrumentation involves adding instrumentation code within the application itself to track common blocking operations, such as disk I/O operations. While this approach can provide detailed information about specific blocking operations, sampling and scheduler tracing are generally preferred because they work for all applications and can capture all off-CPU events, regardless of the cause.  

# II. System Call (syscall) Analysis  
## A. Syscall instrumentation is a technique used for analyzing resource-based performance issues by identifying where time is spent within system calls, which are the interface between user-level applications and the operating system kernel. System call analysis can help in identifying performance bottlenecks related to file operations, network operations, and other kernel-level operations. There are several types of syscall analysis:  
1. New process tracing involves tracking when new processes are created by the application, which can be useful for understanding the application's behavior and identifying potential performance issues related to process creation.  
2. I/O profiling involves analyzing the time spent in I/O operations, such as reading from or writing to files or network sockets, which can help identify bottlenecks related to I/O operations.  
3. Kernel time analysis involves understanding the causes of high kernel CPU usage, which can help identify performance issues related to the operating system kernel itself.  

## B. Syscall analysis offers several advantages:  
1. System calls have well-documented APIs (man pages), which makes the analysis easier to understand and interpret.  
2. Syscall analysis is synchronous with the application, meaning that it provides the application code path context (stack traces) that led to the syscall, which can aid in understanding the root cause of performance issues.  
3. The data collected from syscall analysis can be represented using flame graphs, which are a visualization technique that makes it easier to understand where time is being spent in the application and the kernel.  

# III. USE Method  
## A. The USE method is a systematic approach to identifying and diagnosing performance issues by checking the utilization, saturation, and errors of hardware resources. This method helps in identifying resource bottlenecks that may be causing application performance issues by analyzing various hardware resources, such as CPU, memory, disk, and network.  
## B. By checking the utilization (how much of a resource is being used), saturation (whether a resource is fully utilized or has reached its capacity), and errors (any errors or failures related to a resource), the USE method provides a structured way to identify resource bottlenecks that may be causing performance issues in an application.  

# IV. Thread States  
## A. Instead of viewing threads as simply being in an on-CPU or off-CPU state, a more detailed analysis can be performed by considering nine different thread states: user, kernel, runnable, swapping, disk I/O, network I/O, sleeping, lock, and idle. These thread states provide a more granular view of what a thread is doing at any given time, which can aid in performance analysis and optimization efforts.  
## B. By reducing the time spent by threads in states other than idle, an application's request latency and throughput can be improved. Identifying and optimizing the code paths that lead to threads spending excessive time in states like disk I/O, network I/O, or lock contention can result in significant performance improvements.  

# V. Distributed Tracing  
## A. Distributed tracing is a technique used for studying the performance of distributed applications, which are applications made up of multiple services running on different machines or processes. Distributed tracing involves logging information about each service request as it flows through the distributed system, and then combining the logs from different services to obtain an end-to-end view of the request's lifecycle.  
## B. Implementing distributed tracing can be challenging due to the large volume of log data that can be generated, especially in high-traffic systems. To address this challenge, several techniques can be employed:  
1. Head-based sampling involves logging only a subset of requests, which can help reduce the volume of log data generated.  
2. Tail-based sampling involves logging requests that exceed a certain latency threshold, which can help in analyzing intermittent errors or performance outliers.  
3. However, there is a trade-off between reducing the volume of log data and the ability to analyze intermittent errors or performance outliers effectively.  

# VI. Tools  
## A. perf is a standard Linux profiler that includes multiple tools for performance analysis:  
1. The CPU profiling tool can be used to profile CPU usage and generate CPU flame graphs, which are a visualization technique that helps identify where time is being spent in the application.  
2. The syscall tracing tool can be used to trace system calls made by an application, which can help identify performance issues related to I/O operations or kernel-level bottlenecks.  

## B. BCC (BPF Compiler Collection) and bpftrace are collections of tools based on eBPF (extended Berkeley Packet Filter) technology, which allows for efficient kernel tracing with minimal overhead:  
1. The profile tool is a timer-based CPU profiler that can be used to profile CPU usage and generate CPU flame graphs.  
2. The offcputime tool can summarize the time spent by threads in the off-CPU state, along with the corresponding stack traces, which can help identify the code paths that lead to threads being blocked or waiting.  
3. The strace tool can be used for system call tracing, similar to the syscall tracing tool in perf.  
4. The execsnoop tool can be used to trace the creation of new processes, which can be useful for understanding the application's behavior and identifying potential performance issues related to process creation.  
5. The syscount tool can be used to count the occurrences of specific system calls, which can help identify hot spots or bottlenecks related to particular kernel-level operations.  
6. The bpftrace tool is a powerful scripting tool that can be used for tracing a wide range of events, including signals, I/O operations, lock contention, and more, which can aid in performance analysis and debugging.  

Date
June 7, 2024