Introduction to Using Profiling Tools

Performance tools

Frequently, we need to identify slow portions of our programs so we can improve performance. There are a number of tools available to profile programs and identify how much time is spent where. The most common of these tools sample the program periodically, recording information to be later analyzed. Typically, they involve a phase spent recording data and a later phase for analyzing it. We will use two common tools to analyze a simple program: Google pprof and Linux perf.

Google pprof

Google pprof is a tool available as part of the Google perftools package. It is is used with libprofiler, a sampling based profiler that is linked into your binary. There are 3 steps for using pprof: linking it into the binary, generating profile output, and analyzing the output. The following links a binary with libprofiler:

% gcc main.c -lprofiler

For any profile linked with libprofiler, setting the environment variable CPUPROFILE enables profiling and specifies the output file. The following command runs ./a.out and prints profiling data to out.prof:

% CPUPROFILE=out.prof ./a.out

We can now analyze this file using pprof. Below, we output the sample counts for all the functions in a.out:

% pprof --text ./a.out out.prof
... <snip> ...
Total: 311 samples
  144  46.3%  46.3%      144  46.3% bar
   95  30.5%  76.8%       95  30.5% foo
   72  23.2% 100.0%      311 100.0% baz
    0   0.0% 100.0%      311 100.0% __libc_start_main
    0   0.0% 100.0%      311 100.0% _start
    0   0.0% 100.0%      311 100.0% main

See full documentation here.

Linux perf

On Linux, the perf system is a powerful tool for analyzing program / system performance. It provides some nice abstractions over tracking hardware counters on different CPUs. It defines a number of events to be tracked and recorded. Run perf list to see a list of the events allowed on your system.

To use perf, you run:

% perf stat ./a.out
 Performance counter stats for './a.out':

    3121.725439 task-clock                #    0.997 CPUs utilized          
             11 context-switches          #    0.000 M/sec                  
              7 CPU-migrations            #    0.000 M/sec                  
            308 page-faults               #    0.000 M/sec                  
  9,121,960,506 cycles                    #    2.922 GHz                     [83.32%]
  5,213,187,548 stalled-cycles-frontend   #   57.15% frontend cycles idle    [83.32%]
    292,952,401 stalled-cycles-backend    #    3.21% backend  cycles idle    [66.68%]
  5,215,556,086 instructions              #    0.57  insns per cycle        
                                          #    1.00  stalled cycles per insn [83.35%]
  1,303,060,483 branches                  #  417.417 M/sec                   [83.35%]
         66,559 branch-misses             #    0.01% of all branches         [83.33%]

    3.132028707 seconds time elapsed

In addition to perf stat, there quite a few other ways to use perf. Run

% perf

to see a list of the commands (you might want to look into perf record and perf annotate).

For an example of this being used in real life, see this excellent analysis of this analysis of a string comparison bottleneck in git gc

Our Investigation

We compile the program with -lprofiler so we can generate output to examine. try_perf.c is a C program that counts the number of even values in an array of random numbers. We run with 8 threads that all increment a global counter every time they see an even number.

% gcc try_perf.c -g -lprofiler -lpthread
% CPUPROFILE=a.out.prof ./a.out --num_threads=8

We run pprof and get the source code annotated with the number of probes that hit that instruction during the trace (result below trimmed for brevity).

% pprof --list=thread_scan a.out a.out.prof
 ... <snip> ...
   .      .   60: void* thread_scan(void* void_arg) {
   .      .   61:    // TODO(awreece) Copy locally so dont interfere with each other.
   .      .   62:  thread_arg_t* args = (thread_arg_t*) void_arg;
   .      .   63:  size_t i;
   .      .   64: 
 303    323   65:  for (i = 0; i < arg->size; i++) {
   6     10   66:     uint32_t val = arg->input[i];
   6     15   67:   if (val % 2 == 0) {
   9    300   68:     __sync_fetch_and_add(args->evens, 1);
   .      .   69:   }
   .      .   70:  }
   .      .   71: }

The output above is actually misleading: if you look at the assembly (shown below), the instruction immediately after the atomic instruction (the addq $0x1,-0x8(%rbp) after the lock addq $0x1,(%rax)) gets excess hits that count towards the for loop when they should probably count towards the atomic instruction.

% pprof --disas=thread_scan a.out a.out.prof
 ... <snip> ...
  9    300    68: __sync_fetch_and_add(arg->num_evens, 1);
  4      5      4008a4: mov    -0x10(%rbp),%rax
  1      5      4008a8: mov    0x10(%rax),%rax
  4    290      4008ac: lock addq $0x1,(%rax)
303    320    65: for (i = 0; i < arg->size; i++) {
286    287      4008b1: addq   $0x1,-0x8(%rbp)
  1      2      4008b6: mov    -0x10(%rbp),%rax

Hrm. Why are we spending a lot of time in lock addq $0x1,(%rax)?

To understand this, we will use perf. Run:

% perf stat ./a.out
 Performance counter stats for './a.out':

    5793.307952 task-clock                #    2.157 CPUs utilized          
            589 context-switches          #    0.000 M/sec                  
             11 CPU-migrations            #    0.000 M/sec                  
          1,974 page-faults               #    0.000 M/sec                  
 16,378,904,731 cycles                    #    2.827 GHz                     [83.37%]
 10,407,719,950 stalled-cycles-frontend   #   63.54% frontend cycles idle    [83.38%]
  8,213,634,448 stalled-cycles-backend    #   50.15% backend  cycles idle    [66.65%]
 12,070,323,273 instructions              #    0.74  insns per cycle        
                                          #    0.86  stalled cycles per insn [83.32%]
  2,428,236,441 branches                  #  419.145 M/sec                   [83.31%]
     67,558,697 branch-misses             #    2.78% of all branches         [83.35%]

    2.685598183 seconds time elapsed

Wow, thats a lot of stalled instructions! The 8 threads are sharing the same counter, generating a lot of memory traffic. We modify the program so they all use their own counter, and then we aggregate at the end (if we do this, we don't need to use the atomic instruction).

size_t counts[nthreads];
size_t num_evens = 0;

for (i = 0; i < nthreads; i++) {
     counts[i] = 0;
     args[i].num_evens = &counts[i];
     args[i].input = &inarray[i * (size / nthreads)];
     args[i].size = size / nthreads;
     pthread_create(&threads[i], NULL, thread_scan, &args[i]);
 }   
 for (i = 0; i < nthreads; i++) {
     pthread_join(threads[i], NULL);
     num_evens += counts[i];
 }

But that didn't seem to help at all! We still spend most of our time on the increment, even though we aren't using an atomic instruction:

% pprof --list=thread_scan a.out out.prof
... <snip> ...
  .      .   60: void* thread_scan(void* void_arg) {
  .      .   61:    // TODO(awreece) Copy locally so dont interfere with each other.
  .      .   62:  thread_arg_t* args = (thread_arg_t*) void_arg;
  .      .   63:  size_t i;
  .      .   64: 
 22     44   65:  for (i = 0; i < args->size; i++) {
 14     25   66:     uint32_t val = args->input[i];
 12     33   67:   if (val % 2 == 0) {
157    308   68:    *(args->num_evens) += 1;
  .      .   69:   }
  .      .   70:  }
  .      .   71: }

Why could this be? Lets run perf stat again and see:

% perf stat ./a.out
 Performance counter stats for './a.out':

      4372.474270 task-clock                #    1.882 CPUs utilized          
              385 context-switches          #    0.000 M/sec                  
                9 CPU-migrations            #    0.000 M/sec                  
            1,135 page-faults               #    0.000 M/sec                  
   12,411,517,583 cycles                    #    2.839 GHz                     [83.26%]
    6,270,257,100 stalled-cycles-frontend   #   50.52% frontend cycles idle    [83.33%]
    4,291,405,838 stalled-cycles-backend    #   34.58% backend  cycles idle    [66.78%]
   12,306,996,386 instructions              #    0.99  insns per cycle        
                                            #    0.51  stalled cycles per insn [83.39%]
    2,420,224,187 branches                  #  553.514 M/sec                   [83.40%]
       69,182,448 branch-misses             #    2.86% of all branches         [83.30%]

      2.323372370 seconds time elapsed

What is going on now? We still have a lot of stalled instructions, but all those counters are different. See?

size_t counts[nthreads];

Oh, they are all on the same cache line - we're experiencing false sharing. Let us use a thread local counter thats on a different cache line for each thread:

void* thread_scan(void* void_arg) {
  thread_arg_t* args = (thread_arg_t*) void_arg;
  size_t i;
  size_t num_evens = 0;

  for (i = 0; i < args->size; i++) {
    uint32_t val = args->input[i];
    if (val % 2 == 0) {
      num_evens++;
    }
  }
  return num_evens;
}

... <snip> ...

for (i = 0; i < nthreads; i++) {
  size_t count;
  pthread_join(threads[i], &count);
  num_evens += count;
}

And then look at the profile:

% pprof --list=thread_scan a.out out.prof
... <snip> ...
  .      .   60: void* thread_scan(void* void_arg) {
  .      .   61:    // TODO(awreece) Copy locally so dont interfere with each other.
  .      .   62:  thread_arg_t* args = (thread_arg_t*) void_arg;
  .      .   63:  size_t i;
  .      .   64:  size_t num_evens;
  .      .   65: 
144    292   66:  for (i = 0; i < args->size; i++) {
 14     25   67:     uint32_t val = args->input[i];
 12     33   68:   if (val % 2 == 0) {
 13     16   69:    num_evens++;
  .      .   70:   }
  .      .   71:  }
  4      8   72:  return num_evens;
  .      .   73: }

Good, our increment doesn't dominate the function anymore. We look at perf stat and see:

% perf stat ./a.out
 Performance counter stats for './a.out':

    2977.781539 task-clock                #    1.472 CPUs utilized          
            177 context-switches          #    0.000 M/sec                  
             12 CPU-migrations            #    0.000 M/sec                  
          3,506 page-faults               #    0.001 M/sec                  
  8,523,367,658 cycles                    #    2.862 GHz                     [83.32%]
  2,057,253,537 stalled-cycles-frontend   #   24.14% frontend cycles idle    [83.26%]
    919,272,160 stalled-cycles-backend    #   10.79% backend  cycles idle    [66.70%]
 12,067,358,492 instructions              #    1.42  insns per cycle        
                                          #    0.17  stalled cycles per insn [83.42%]
  2,454,951,795 branches                  #  824.423 M/sec                   [83.42%]
     67,544,262 branch-misses             #    2.75% of all branches         [83.42%]

    2.022988074 seconds time elapsed

Ah, perfect! 30% faster than our original solution and significantly fewer stalled instructions.

Comments !