Analysis of a Parallel Memory Allocator

Background

Problem

Many modern programs frequently use dynamic memory allocation. However, modern programs increasingly are multithreaded and parallel to take advantage of increasingly parallel processors. Unfortunately, this trend conflicts with the fact that there is a single heap in most current programs. Consequently, research into parallel memory allocators is topical and important.

Solution?

The simplest solution to ensuring correctness in a multithread memory allocator is to use a global lock around the heap. Unfortunately, this has extremely negative performance consequences and is almost never adopted by modern memory allocators. Modern memory allocators tend to adopt some form of the following 3 solutions:

  • They partition the heap into logical arenas or chunks that handle large portions of the heap. This reduces contention on the global heap and heap data structures.
  • They use fine grained locking on individual slabs or slab classes.
  • They use thread local caches to provide a fast path that requires no locks.

Modern memory allocators

As I understand, the most popular modern parallel mallocs are jemalloc, tcmalloc, ptmalloc, concur, nedmalloc and hoard. Oracle did some investigation and I have taken a look at the internals of jemalloc, tcmalloc, concur, and hoard. As I understand:

  • tcmalloc uses a global slab allocator with thread local caches to avoid contention
  • hoard uses different arenas and assigns superblocks to threads to avoid contention
  • jemalloc uses different arenas and thread local caches to avoid contention and uses red black trees and an optimized slab allocator to avoid fragmentation
  • concur uses different arenas and fine grained locking on size classes to avoid contention

One interesting characteristic of many of these memory allocators is that they all tend to allocate memory from the system in chunks of about 1 to 4MB. Consequently, they tend to have an overhead of up to 2 to 4MB per arena. Most of them justify this overhead by pointing out that 2MB of overhead is minimal when the total application footprint can exceed 1GB (in an application such as firefox) and it is acceptable for an application to use 2MB of heap when modern computers routinely have several GB of RAM.

Another interesting characteristic of these memory allocators is they almost never coallesce individual blocks (some do coallesce individual blocks). Instead, they use slab allocators and assume allocation requests tend be of very similar sizes. In general, this follows the general pattern of tolerating a moderate amount of memory overhead to increase performance.

Approach

A simple modern memory allocator

In order to investigate and analyze the performance of a modern memory allocator, I wrote a simplified memory allocator, ar_malloc, that uses many of the modern optimizations. ar_malloc is based quite heavily on jemalloc but makes some simplifications. In order to keep the work manageable, ar_malloc makes the assumption that allocation requests are smaller than 1024 bytes. In addition, it uses slabs of a fixed size and never frees memory to the system (jemalloc uses variable sized slabs to reduce memory overhead).

Testing a memory allocator ##

In order to test ar_malloc, I constructed a test framework (based off a test in the tcmalloc codebase) that spawns several threads that each randomly decide to allocate a random sized block or free a random block. This does not simulate the effect of actually using the blocks and does not simulate a realistic workload, but it is still a useful basis for investigation. I ran this test on a 16 core shared memory system and used new initialization of malloc for each run to reduce the variance in run time.

Results

Comparision of ar_malloc to other solutions

We compared the performance of ar_malloc, ar_malloc with a global lock, and the libc malloc on the test described in the previous section.

Run time vs Number of threads
This is chart of test run time vs number of threads for a global locked malloc, ar_malloc, and libc malloc. As you can see, the global lock solution is really bad.

This is chart of test run time vs number of threads for ar_malloc and libc malloc. As you can see, ar_malloc is about 3 times faster than libc for even single threaded execution.
This is chart of test speedup vs number of threads for ar_malloc and libc malloc. As you can see, ar_malloc exhibits linear speedup that scales cleanly with the number of threads, whereas libc scales only to about 8 threads.

Comparison of different configuration

I examined several different configurations of ar_malloc, specifically focusing on the number of arenas. We attempted to figure out the effect of and analyze the behavior of using different number of arenas.

This is a chart of run time vs number of threads for different configurations of ar_malloc. As you can see, there appear to be two curves. We will call the lower one the "no contention" curve and the upper one the "contention" curve. You can see that the performance of a memory allocator moves from the "no contention" curve to the "contention" curve when the number of threads exceeds the number of arenas.
This is a chart of speedup vs number of threads for different configurations of ar_malloc. As you before, there are two curves: the "no contention" line and the "contention" line. Again, the speedup of a memory allocator moves from the "no contention" line to the "contention" line when the number of threads exceeds the number of arenas. It is important to note that the speedup is still mostly linear even when the number of arenas is far less than number of threads.

Conclusion

Over the course of this project, I have demonstrated that it is feasible to write a modern parallel memory allocator that performs quite favorably on random workloads. ar_malloc makes many simplifying assumptions, but is just over 2000 lines of code, outperforms libc malloc by a factor of 3, and demonstrates linear speedup that seems to scale very well with the number of threads.

Further Investigation

There are several routes for further investigation in parallel memory allocators.

The exisiting test framework allocates random sizes distributed uniformly in the range 8, 1024. This almost certainly does not simulate realistic memory allocation patterns. An interesting further exploration could use ar_malloc with real programs (either via static linking or LD_PRELOAD) or to investigate the actual memory distribution of a general program.

This investigation only examined the effect of different number of arenas. A further exploration could examine the effect of thread local caches and fine grained locking on the performance of ar_malloc.

Comments !