My most recent post was a story about a successful performance investigation. Unfortunately, this happy post was intentionally simplified for the audience of my company's blog. The reality is that I frequently find doing performance investigations on Linux a frustrating process due to the lack of adequate tooling. In this blog post, I document some recent pain points and then speculate that such issues are endemic to Linux.
Some pain points
Compare how we answer the following questions1 (in production) in Linux and FreeBSD/Solaris:
Why am I going off CPU?
-
FreeBSD/Solaris:
# dtrace -n 'sched:::off-cpu /pid==$target/ { self->ts = timestamp; } sched::on-cpu /self->ts/ { @[stack(), ustack()] = quantize(timestamp - self->ts); self->ts = 0}' -p $PID
-
Linux:
# perf record -g -e 'sched:sched_switch' -e 'sched:sched_stat_sleep' -e 'sched:sched_stat_blocked' -p $PID
Then post process with
perf inject -s
(if you have it) or my 200 linestackcollapse-perf-sched.awk
script. Unfortunately, tracing all scheduler events is very high overhead in perf and the lack of in-kernel aggregation means that events probably will be dropped if load is high. Even more alarmingly, there appear to be bugs in perf that prevent it from reliably getting consistent traces (even with large trace buffers), causing it to produce empty perf.data files with error messages of the form:0x952790 [0x736d]: failed to process type: 3410
Since this failure is not deterministic, re-executing
perf record
will eventually succeed; however, it then can sometimes be hard to catch the pathology.
Why am I calling malloc
?
-
FreeBSD/Solaris:
# dtrace -n 'pid$target::malloc:entry { @[ustack()] = count(arg0); } -p $PID'
-
Linux: It is possible you could write a
gdb
script, although I'd be scared to do this in production. Another alternative is use theuprobe
interface:# perf probe /lib/x86_64-linux-gnu/libc.so.6 malloc 'size=%di' # perf record -e probe_libc:malloc -g --pid $PID`
Aside from the common message about the overhead of not having in kernel aggregates, I've found this interface to be particularly brittle. Even scarier, Brendan Gregg also warns about issues that would make it unsuitable for production environments (I haven't personally seen this when using
perf
yet):frequently hit issues where the target process either crashes or enters an endless spin loop
What files am I reading?
-
FreeBSD/Solaris:
# dtrace -n 'syscall::read:entry { @[fds[arg0].fi_pathname] = sum(arg2); }' -p $PID
-
Linux:
strace
used to be a traditional solution, although it has high overhead. With# perf trace -eread -p $PID
You can now save the data and post process (although I must repeat the message about the overhead of not having in kernel aggregates).
The problem with stack traces.
Almost all of the questions I have are of the form: "Why is my application X", and are answered by looking at userspace stack traces. Unfortunately, this is almost impossible to do in my Linux distribution. Ubuntu compiles libc
with -fomit-frame-pointer
(the gcc
default for -O2
and above), which stymies perf
s ability to walk stacks that go through the most commonly used system library. Worse, it is not clear that the perf
on Ubuntu properly supports walking dwarf
unwind information. I've used --call-graph=dwarf
only moderately successfully, as it appears to lose frames compared to the stacks in gdb
, etc.
When Linux applications omit frame pointers, they are eschewing future visibility for the sake of an immediate micro-optimization. Some applications choose a compromise of merely omitting frame pointers for leaf functions (causing the second to last frame of every stack to be omitted). In either, aggressive function inlining causes the resulting stack trace to be an approximation of the logical stack trace. As a stark contrast, the Illumos kernel forces frame pointers and prohibits small functions from being automatically inlined to guarantee precise stack traces. Since the most convenient place to add dynamic tracepoints via dtrace
is at function call boundaries, this also ensures that there will be a plethora of options if it ever become necessary to debug some new component.
Introspection in Linux
It seems to me that the Linux community does not support the same level of kernel/system introspection as Solaris and I think this issue is cultural, not technical. Take a look at a famous quote2 from Linus himself in 2000 on why he eschews the use of a kernel debugger:
I don't think kernel development should be "easy". ... I do not think that extra visibility into the system is necessarily a good thing.
In Linux, the kernel is a complicated, mysterious system that is only intended to be fully understood by the authors, so tools like kernel debuggers are not necessary.
Compare to a recent quote from an equally opinionated former Solaris kernel engineer, Bryan Cantrill, on why it is important to have observability into Docker containers:
Don't just reboot your pc, goddammnit, debug it! Come on, you're an educated person, right? Or at least you want to act like one around other educated people!
In that talk, Bryan emphatically calls for bigger observability into a Linux subsystem so that complicated problems in production can be diagnosed and fixed. He condemns the engineering anti-pattern of only adding observability when something has gone wrong, pointing out that this puts the engineer in the position of trying to reproduce a potentially transient issue. Flexible dynamic tracing tools with the power to deeply introspect the kernel, like dtrace
, are necessary to observe and debug such issues in production.
A question driven methodology
The thing about dtrace
is that it is so powerful and so flexible that it encourages a question driven methodology rather than tool driven methodology. Rather than merely trying to infer problems from some sar
utility, we ask simple and specific questions to root cause an issue. The common tools (iostat
, top
, sar
, etc.) are useful as a means to prompt these questions but are rarely useful on their own. When Brendan Gregg advocates for the USE method or the TSA method, he is providing frameworks for converting these sar
metrics into useful questions.
I have a story that illustrates this point quite well. A couple of years ago, I did a mock technical interview with Adam Leventhal where he walked me through a real performance investigation he had previously done. For the interview, he described a faulty server to me and asked me to debug it using only a "magical oracle" that could answer any question about the system -- over the course of the interview, he explained how he had used dtrace
to answer every question we had put to the "oracle". By looking beyound the simple sar
metrics and asking focused questions, we were able to "resolve" in an hour an issue that would be nigh impossible to understand otherwise.
Linux supports some dynamic tracing tools that enable these types of investigations, but (at least until recently) they remain second class citizens. perf
and ftrace
feel brittle and limited compared to the deep visibility of dtrace
. In contrast, dynamic tracing is so valuable to Solaris that the kernel is built specifically with dtrace
in mind. High fidelity dynamic tracing of production is too important to be approximated by separate tools; it is baked into every layer of the system.
What can we do about this?
The question driven methodology that I've learned from my mentors from the Solaris community has proven challenging for me as I start working more within my Linux-based company. I'm slowly learning to use perf
to answer my questions and I'm slowly changing the culture at my current company. Some highlights:
- We compile with
-fno-omit-frame-pointer
, which increases the fidelity of our CPU stacks collected fromperf
. - We're slowly cutting back on the amount of unnecessary function inlining and have simple tools in place to detect eggregious examples of this.
- We have hooks to increase debuggability (via recording JIT-ed symbols for
perf
, etc) that can be dynamically enabled at run time for ad-hoc investigations. Previously, they required a server restart. - Almost all engineers at the company use flame graphs to do performance investigations. They quickly answer the first question of "where is my cpu time being spent" and it has become uncommon to see a performance regression bug filed without an attached flame graph.
- The most recent iteration of performance tests were actively benchmarked with flame graphs as they were written. Through this process, we caught and corrected several tests that were testing the wrong path.
-
I acknowledge that these questions are particularly painful to answer with
perf
, but I am not nearly as frustrated at the awkwardness of doing so withperf
than I am scared of the instability inperf
when answering them. I want my system introspection tool to feel rock solid and I don't quite have that confidence withperf
. ↩ -
I acknowledge both quotes are taken only slightly out of context to get the most exciting blurb, but I believe I honestly captured the sentiment of their authors in my analysis. ↩
Comments !