A Basic Exploration of Linux Perf Tools
I’ve been reading Brendan Gregg’s Systems Performance recently. It is a very thorough accounting of OS-level performance concerns, with a practical focus on Linux. At various points, it discusses common programs that do perf analysis on Linux, like perf
, strace
, syscount
, etc.
It’s good info and all, but I found myself wanting to actually get my hands dirty and run the programs on a real system. Fortunately, I have a little Linux VPS running a Postgres database for some side projects. So, I decided to see if I could run these things and learn something, really anything, about these tools and how they could help me understand process behavior.
Part 1: Installing Everything
The first step to running something like perf
/strace
/syscount
/etc is installing them. My mental model of this process is “ssh
into the machine, and run some apt install
commands.” This mostly worked. I ran the following commands (found by Googling around):
apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r) # installs perf
apt install strace # installs strace
apt install bpfcc-tools # installs syscount
Then, I rebooted the machine because the outputs of one of the calls said that was needed. My side projects can handle a couple minutes of downtime, so this was nbd.
I was able to verify that everything was installed correctly via which
on the appropriate commands:
$ which perf # prints /usr/bin/perf
$ which strace # prints /usr/bin/strace
$ which syscount-bpfcc # prints /usr/sbin/syscount-bpfcc
Thankfully, not too many bumps here.
Part 2: Using the Tools
Syscount
I kicked off a batch processing job that uses the database pretty aggressively, and then tried to see what the tools said about it.
First, I used syscount
since that seemed like the easiest thing to do. Here’s the output from running the
$ syscount
# I let it run for ~5s, then Ctrl-C to get these results
SYSCALL COUNT
pread64 33651
fadvise64 15043
epoll_wait 2144
read 2034
lseek 1909
sendto 1870
write 1442
kill 1175
recvfrom 683
futex 593
The results are pretty unsurprising. I’m executing lots of queries that ask for significant swaths of a particular table. So, the large number of pread64
calls (which read from a file descriptor) make lots of sense. On the other hand, I was pretty unfamiliar with fadvise64
. Evidently, it’s a system call that lets a user process tell the kernel what kind of read pattern it expects to use for a given file descriptor. This would then give the kernel time to make performance-focused optimizations to give the process what it needs when it needs it. I’m a little surprised that there are this many calls to fadvise64. It seems to imply that the process is either changing it’s read strategy every 2 syscalls to pread64, or maybe it’s just sending a bunch of unnecessary calls? Tough to say, but an interesting thing.
The rest of the count data is fairly straightforward. Things that I noted:
lseek
seems pretty rare relative to the number ofread
calls, which seems like it could indicate that we are doing a good job reading data sequentially vs jumping around in a given file descriptor.
sendto
is more common thanrecvfrom
, which makes sense given the number of bytes received in a query would be much less for most queries than the number of bytes returned from a query.
kill
shows up quite a few times, maybe signaling that Postgres spawns a process/thread for a request and then kills it when done.
futex
is here, meaning there might be some amount of lock contention (hard to say if its a lot or a little from this data).
Perf
Ok, let’s do perf
. Googling around, it looks like I want to attach perf
to my Postgres database while it’s running. So, I need a thread ID for one of the threads being used by postgres. I pulled that using ps -L ax | fgrep postgres
and then kicked off perf
:
$ perf stat -t 24424 sleep 10
Performance counter stats for thread id '25301':
2789.15 msec task-clock # 0.279 CPUs utilized
7770 context-switches # 2.786 K/sec
15 cpu-migrations # 5.378 /sec
251 page-faults # 89.992 /sec
4790990155 cycles # 1.718 GHz
45983405 stalled-cycles-frontend # 0.96% frontend cycles idle
144309824 stalled-cycles-backend # 3.01% backend cycles idle
6824593854 instructions # 1.42 insn per cycle
# 0.02 stalled cycles per insn
1231079739 branches # 441.382 M/sec
249247 branch-misses # 0.02% of all branches
10.001295735 seconds time elapsed
Probably a big wall of stuff on the blog, but it’s nicely formatted in my terminal. Not going for style points here, so, it is what it is.
Ultimately, not super sure what to make of these numbers. The only one that kind of stands out is the page-faults
one. Seems like 251
might be kind of high? Tough to say without more experience/context, but clearly a page fault is bad for performance, as it’ll cause a full context switch while the OS fetches the necessary page.
With that said, my recollection from the book is that perf
is super useful for generating flame graphs that show where an application is spending time. I don’t have the time/ambition right now to build one of those, but that would be a good future experiment. My goal for now was to spend an hour exploring and writing about it, and my time is almost up.
Strace
Last, but not least, strace
. This tool will show all the system calls as they happen. I only briefly explored this one, but the output was interesting.
First, I spent a bit of time running ps -L ax | fgrep postgres
to get a PID for an actively running query. Then I attached strace
using:
$ strace -p [postgres PID]
The output was a firehose of system calls as they happened. I didn’t bother to route it to a file, so it just spewed all over the terminal. Not the best. One thing that was cool was that I could pretty clearly see the pattern of system calls that were shown in syscount
above (lots of pread64
, lots of fadvise64
). Here’s a sample:
fadvise64(86, 980475904, 8192, POSIX_FADV_WILLNEED) = 0
pread64(86, "\313\1\0\0\250A\314\274\0\0\4\0,\0\270\3\0 \4 \0\0\0\0x\232\20\v\320\224P\v"..., 8192, 980475904) = 8192
fadvise64(86, 980484096, 8192, POSIX_FADV_WILLNEED) = 0
pread64(86, "\313\1\0\0h4\315\274\0\0\4\0,\0(\3\0 \4 \0\0\0\08\232\220\vp\224\220\v"..., 8192, 980484096) = 8192
fadvise64(86, 980492288, 8192, POSIX_FADV_WILLNEED) = 0
pread64(86, "\313\1\0\0\260=\316\274\0\0\4\0,\0\30\3\0 \4 \0\0\0\0000\232\240\v\210\224P\v"..., 8192, 980492288) = 8192
fadvise64(86, 980500480, 8192, POSIX_FADV_WILLNEED) = 0
pread64(86, "\313\1\0\0\330\20\317\274\0\0\5\0000\0\10\3\0 \4 \0\0\0\0\360\224P\vH\217P\v"..., 8192, 980500480) = 8192
fadvise64(86, 980508672, 8192, POSIX_FADV_WILLNEED) = 0
It’s cool to see the input/output values. Looks like postgres is almost always sending the POSIX_FADV_WILLNEED
flag to the OS in fadvise64
. According to the online man page, this flag means The specified data will be accessed in the near future.
Cool to get this level of visibility into the system.
Another cool thing came out of the input/output values. Stuff like this would show up:
recvfrom(9, "This listing is more expensive t"..., 8192, 0, NULL, NULL) = 1420
And that string is straight out of my application!
I could also see the size of I/O operations:
write(2, "\0\0\367\17\207\4\0\0\20units_compared_availabl"..., 4096) = 4096
4kb iirc is the standard page size in Linux. Cool to see it show up here.
Conclusions
Alright, that’s all I have time for now. Glad I was able to get some basic practice with these tools. Honestly, pretty surprised how easy it was to get set up and get (admittedly small) insight into how my database does its thing. Good stuff.