A Basic Exploration of Linux Perf Tools

I’ve been reading Brendan Gregg’s Systems Performance recently. It is a very thorough accounting of OS-level performance concerns, with a practical focus on Linux. At various points, it discusses common programs that do perf analysis on Linux, like perf, strace, syscount, etc.

It’s good info and all, but I found myself wanting to actually get my hands dirty and run the programs on a real system. Fortunately, I have a little Linux VPS running a Postgres database for some side projects. So, I decided to see if I could run these things and learn something, really anything, about these tools and how they could help me understand process behavior.

Part 1: Installing Everything

The first step to running something like perf/strace/syscount/etc is installing them. My mental model of this process is ssh into the machine, and run some apt install commands.” This mostly worked. I ran the following commands (found by Googling around):

apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r)  # installs perf  
apt install strace  # installs strace  
apt install bpfcc-tools  # installs syscount  

Then, I rebooted the machine because the outputs of one of the calls said that was needed. My side projects can handle a couple minutes of downtime, so this was nbd.

I was able to verify that everything was installed correctly via which on the appropriate commands:

$ which perf  # prints /usr/bin/perf  
$ which strace  # prints /usr/bin/strace  
$ which syscount-bpfcc  # prints /usr/sbin/syscount-bpfcc  

Thankfully, not too many bumps here.

Part 2: Using the Tools

Syscount

I kicked off a batch processing job that uses the database pretty aggressively, and then tried to see what the tools said about it.

First, I used syscount since that seemed like the easiest thing to do. Here’s the output from running the

$ syscount  
# I let it run for ~5s, then Ctrl-C to get these results  
SYSCALL                   COUNT  
pread64                   33651  
fadvise64                 15043  
epoll_wait                 2144  
read                       2034  
lseek                      1909  
sendto                     1870  
write                      1442  
kill                       1175  
recvfrom                    683  
futex                       593  

The results are pretty unsurprising. I’m executing lots of queries that ask for significant swaths of a particular table. So, the large number of pread64 calls (which read from a file descriptor) make lots of sense. On the other hand, I was pretty unfamiliar with fadvise64. Evidently, it’s a system call that lets a user process tell the kernel what kind of read pattern it expects to use for a given file descriptor. This would then give the kernel time to make performance-focused optimizations to give the process what it needs when it needs it. I’m a little surprised that there are this many calls to fadvise64. It seems to imply that the process is either changing it’s read strategy every 2 syscalls to pread64, or maybe it’s just sending a bunch of unnecessary calls? Tough to say, but an interesting thing.

The rest of the count data is fairly straightforward. Things that I noted:

  • lseek seems pretty rare relative to the number of read calls, which seems like it could indicate that we are doing a good job reading data sequentially vs jumping around in a given file descriptor.
  • sendto is more common than recvfrom, which makes sense given the number of bytes received in a query would be much less for most queries than the number of bytes returned from a query.
  • kill shows up quite a few times, maybe signaling that Postgres spawns a process/thread for a request and then kills it when done.
  • futex is here, meaning there might be some amount of lock contention (hard to say if its a lot or a little from this data).

Perf

Ok, let’s do perf. Googling around, it looks like I want to attach perf to my Postgres database while it’s running. So, I need a thread ID for one of the threads being used by postgres. I pulled that using ps -L ax | fgrep postgres and then kicked off perf:

$ perf stat -t 24424 sleep 10  

 Performance counter stats for thread id '25301':  

           2789.15 msec task-clock                #    0.279 CPUs utilized          
              7770      context-switches          #    2.786 K/sec                  
                15      cpu-migrations            #    5.378 /sec                   
               251      page-faults               #   89.992 /sec                   
        4790990155      cycles                    #    1.718 GHz                    
          45983405      stalled-cycles-frontend   #    0.96% frontend cycles idle   
         144309824      stalled-cycles-backend    #    3.01% backend cycles idle    
        6824593854      instructions              #    1.42  insn per cycle         
                                                  #    0.02  stalled cycles per insn  
        1231079739      branches                  #  441.382 M/sec                  
            249247      branch-misses             #    0.02% of all branches        

      10.001295735 seconds time elapsed  

Probably a big wall of stuff on the blog, but it’s nicely formatted in my terminal. Not going for style points here, so, it is what it is.

Ultimately, not super sure what to make of these numbers. The only one that kind of stands out is the page-faults one. Seems like 251 might be kind of high? Tough to say without more experience/context, but clearly a page fault is bad for performance, as it’ll cause a full context switch while the OS fetches the necessary page.

With that said, my recollection from the book is that perf is super useful for generating flame graphs that show where an application is spending time. I don’t have the time/ambition right now to build one of those, but that would be a good future experiment. My goal for now was to spend an hour exploring and writing about it, and my time is almost up.

Strace

Last, but not least, strace. This tool will show all the system calls as they happen. I only briefly explored this one, but the output was interesting.

First, I spent a bit of time running ps -L ax | fgrep postgres to get a PID for an actively running query. Then I attached strace using:

$ strace -p [postgres PID]  

The output was a firehose of system calls as they happened. I didn’t bother to route it to a file, so it just spewed all over the terminal. Not the best. One thing that was cool was that I could pretty clearly see the pattern of system calls that were shown in syscount above (lots of pread64, lots of fadvise64). Here’s a sample:

fadvise64(86, 980475904, 8192, POSIX_FADV_WILLNEED) = 0  
pread64(86, "\313\1\0\0\250A\314\274\0\0\4\0,\0\270\3\0 \4 \0\0\0\0x\232\20\v\320\224P\v"..., 8192, 980475904) = 8192  
fadvise64(86, 980484096, 8192, POSIX_FADV_WILLNEED) = 0  
pread64(86, "\313\1\0\0h4\315\274\0\0\4\0,\0(\3\0 \4 \0\0\0\08\232\220\vp\224\220\v"..., 8192, 980484096) = 8192  
fadvise64(86, 980492288, 8192, POSIX_FADV_WILLNEED) = 0  
pread64(86, "\313\1\0\0\260=\316\274\0\0\4\0,\0\30\3\0 \4 \0\0\0\0000\232\240\v\210\224P\v"..., 8192, 980492288) = 8192  
fadvise64(86, 980500480, 8192, POSIX_FADV_WILLNEED) = 0  
pread64(86, "\313\1\0\0\330\20\317\274\0\0\5\0000\0\10\3\0 \4 \0\0\0\0\360\224P\vH\217P\v"..., 8192, 980500480) = 8192  
fadvise64(86, 980508672, 8192, POSIX_FADV_WILLNEED) = 0  

It’s cool to see the input/output values. Looks like postgres is almost always sending the POSIX_FADV_WILLNEED flag to the OS in fadvise64. According to the online man page, this flag means The specified data will be accessed in the near future. Cool to get this level of visibility into the system.

Another cool thing came out of the input/output values. Stuff like this would show up:

recvfrom(9, "This listing is more expensive t"..., 8192, 0, NULL, NULL) = 1420  

And that string is straight out of my application!

I could also see the size of I/O operations:

write(2, "\0\0\367\17\207\4\0\0\20units_compared_availabl"..., 4096) = 4096  

4kb iirc is the standard page size in Linux. Cool to see it show up here.

Conclusions

Alright, that’s all I have time for now. Glad I was able to get some basic practice with these tools. Honestly, pretty surprised how easy it was to get set up and get (admittedly small) insight into how my database does its thing. Good stuff.


Date
June 8, 2024