Table of contents
Open Table of contents
- Context
- The Heap: Where malloc Lives
- Free Lists: The Simplest Allocator
- Bins: Speeding Up the Search
- The Buddy System
- The Multi-Threaded Problem
- tcmalloc: Thread-Caching Malloc
- jemalloc: Arena-Based Allocation
- Comparing the Allocators
- A Complete malloc/free Example
- Fragmentation: The Eternal Enemy
- Debugging Allocator Issues
- References
Context
Every time you write malloc(64) in C or new Object() in Java (which eventually calls a native allocator), something has to find 64 free bytes somewhere in your process’s address space and hand them back. When you call free(), those bytes need to become reusable again. This sounds simple, but doing it fast, with low fragmentation, across many threads is one of the classic hard problems in systems programming.
The operating system gives your process memory in large chunks (pages, typically 4 KB). The allocator’s job is to carve those pages into the small, variable-sized pieces your program actually needs.
Your program calls malloc()/free()
|
v
+--------------------+
| User-space |
| Allocator | <-- this is what we're studying
| (glibc, tcmalloc,|
| jemalloc, etc.) |
+--------+-----------+
|
| brk() / mmap()
v
+--------------------+
| OS Kernel |
| (virtual memory) |
+--------+-----------+
|
v
+--------------------+
| Physical RAM |
+--------------------+
The allocator sits between your code and the kernel. It batches kernel calls (which are expensive) and manages a heap — a region of virtual memory that grows as needed.
The Heap: Where malloc Lives
When your process starts, the kernel sets up a memory layout like this:
High addresses
+------------------+
| Stack | grows downward
| | |
| v |
| |
| (unmapped) |
| |
| ^ |
| | |
| Heap | grows upward (via brk/sbrk)
+------------------+ <-- program break
| BSS (zeros) |
| Data (globals) |
| Text (code) |
+------------------+
Low addresses
The heap starts just above the program’s static data. The system call brk() moves the “program break” upward to grow the heap. Modern allocators also use mmap() to request memory from arbitrary virtual addresses, especially for large allocations.
Here is a simplified version of how glibc’s malloc requests memory from the kernel:
// Simplified — real glibc is much more complex
void *request_memory(size_t size) {
if (size >= MMAP_THRESHOLD) { // typically 128 KB
// Large allocation: use mmap directly
return mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
}
// Small allocation: extend the heap
void *old_break = sbrk(0);
if (sbrk(size) == (void *)-1) return NULL;
return old_break;
}
Small allocations extend the heap; large ones get their own mmap region that can be returned to the OS independently. This dual strategy avoids a common problem: a single large allocation at the top of the heap preventing the break from being lowered, even after everything below it is freed.
Free Lists: The Simplest Allocator
The most fundamental data structure in an allocator is the free list — a linked list of available memory blocks. When you malloc, the allocator walks the list looking for a block that fits. When you free, the block goes back on the list.
Free list (singly linked):
head
|
v
+--------+ +--------+ +--------+
| 64 B |---->| 128 B |---->| 32 B |----> NULL
| (free) | | (free) | | (free) |
+--------+ +--------+ +--------+
malloc(48) scans the list:
64 B >= 48? Yes -> use this block
Each free block stores its size and a pointer to the next free block inside the free memory itself — since the memory is unused, we can repurpose it:
typedef struct block_header {
size_t size;
struct block_header *next;
int free; // 1 = free, 0 = allocated
} block_header;
When malloc(n) is called, the allocator searches for a free block with size >= n. There are three classic strategies:
- First fit: Return the first block that is large enough. Fast, but leads to fragmentation at the start of the list.
- Best fit: Scan the entire list and return the smallest block that fits. Less waste per allocation, but slower ( scan) and creates many tiny unusable fragments.
- Next fit: Like first fit, but start scanning from where the last search left off. Spreads allocations more evenly.
Splitting and Coalescing
If the allocator finds a 256-byte block but you only need 48 bytes, it splits the block:
Before split:
+--------------------+
| 256 B (free) |
+--------------------+
After malloc(48):
+----------+---------+
| 48 B | 200 B | (8 bytes lost to header overhead)
| (used) | (free) |
+----------+---------+
When you free a block, the allocator should coalesce (merge) it with adjacent free blocks to prevent fragmentation:
Before free(B):
+--------+--------+--------+
| A | B | C |
| (free) | (used) | (free) |
+--------+--------+--------+
After free(B) + coalescing:
+---------------------------+
| A + B + C |
| (free, merged) |
+---------------------------+
Without coalescing, you would end up with many small free blocks that can’t satisfy larger requests even though the total free memory is sufficient. This is called external fragmentation.
The glibc implementation in malloc/malloc.c uses a boundary-tag technique: each chunk stores its size at both the beginning and end, so the allocator can find the previous chunk in and merge if it’s free.
Bins: Speeding Up the Search
Scanning a single linked list is too slow for a real allocator. glibc’s ptmalloc2 organizes free blocks into bins by size:
Bins array (simplified):
Index Size range Data structure
----- ---------------- ----------------
[0] unsorted doubly linked list (recently freed)
[1] 16 bytes doubly linked list
[2] 24 bytes doubly linked list
... ... ...
[63] 512 bytes doubly linked list
[64] 512 - 576 B sorted tree (red-black or skip list)
... ... ...
[126] >= 1 MB sorted tree
"Small bins" (exact size, fast lookup)
"Large bins" (size ranges, tree-sorted)
- Fast bins: Very small sizes (16–80 bytes on 64-bit). Singly linked, LIFO. No coalescing — speed over space.
- Small bins: Exact sizes up to 512 bytes. Doubly linked, FIFO. Coalescing on free.
- Large bins: Size ranges above 512 bytes. Sorted so best-fit is efficient.
- Unsorted bin: A staging area for recently freed chunks. The allocator checks here first and sorts chunks into the correct bin lazily.
This binning turns the common case — allocating a small, popular size — into an operation: look up the bin by size, pop the first block.
The Buddy System
An alternative to free lists is the buddy allocator, used by the Linux kernel for page-level allocation and by some user-space allocators. The idea: all blocks are powers of two, and every block has a “buddy” at a known address.
Start with a 1024-byte region.
Request 128 bytes:
Step 1: Split 1024 -> two 512-byte buddies
+-------------------+-------------------+
| 512 B | 512 B |
+-------------------+-------------------+
Step 2: Split left 512 -> two 256-byte buddies
+---------+---------+-------------------+
| 256 B | 256 B | 512 B |
+---------+---------+-------------------+
Step 3: Split left 256 -> two 128-byte buddies
+----+----+---------+-------------------+
|128 |128 | 256 B | 512 B |
|used|free| (free) | (free) |
+----+----+---------+-------------------+
To free a block, check if its buddy is also free. If so, merge them back into the parent, and recursively check the parent’s buddy:
Free the 128-byte block:
+----+----+---------+-------------------+
|free|free| 256 B | 512 B |
+----+----+---------+-------------------+
|
v (buddy is free, merge)
+---------+---------+-------------------+
| 256 B | 256 B | 512 B |
+---------+---------+-------------------+
|
v (buddy is free, merge)
+-------------------+-------------------+
| 512 B | 512 B |
+-------------------+-------------------+
|
v (buddy is free, merge)
+---------------------------------------+
| 1024 B |
+---------------------------------------+
The key insight: a block’s buddy address can be computed with a single XOR. If a block of size starts at address , its buddy is at . This makes the “is my buddy free?” check very fast.
The Linux kernel’s page allocator (mm/page_alloc.c in the kernel source) uses exactly this scheme with orders 0 through 10 (4 KB to 4 MB). You can see the current state in /proc/buddyinfo:
$ cat /proc/buddyinfo
Node 0, zone Normal 4096 2048 1024 512 256 128 64 32 16 8 4
# ^ ^ ^ ...
# 4KB 8KB 16KB page counts at each order
Trade-off: The buddy system has zero external fragmentation (blocks always merge cleanly) but suffers from internal fragmentation — a 65-byte request wastes 63 bytes because it rounds up to 128.
The Multi-Threaded Problem
A single free list with a single lock becomes a bottleneck when dozens of threads all call malloc and free simultaneously. Every thread contends for the same lock:
Thread 1 --> malloc(32) --+
Thread 2 --> malloc(64) --+--> [ LOCK ] --> single heap --> [ UNLOCK ]
Thread 3 --> free(ptr) --+
Thread 4 --> malloc(16) --+
All 4 threads serialize on the lock. Throughput collapses.
This is the central problem that modern allocators (tcmalloc, jemalloc, mimalloc) solve.
tcmalloc: Thread-Caching Malloc
Google’s tcmalloc (Thread-Caching Malloc) uses a two-level design: each thread gets its own cache of small objects, and a shared central heap handles the rest.
+--------------------------------------------------+
| Process |
| |
| Thread 1 Thread 2 Thread 3|
| +-----------+ +-----------+ +------+|
| |Thread | |Thread | |Thread||
| |Cache | |Cache | |Cache ||
| | | | | | ||
| |free lists | |free lists | |free ||
| |by size | |by size | |lists ||
| |class | |class | | ||
| +-----+-----+ +-----+-----+ +--+---+|
| | | | |
| | refill/return | | |
| v v v |
| +--------------------------------------------+ |
| | Central Free Lists | |
| | (one per size class) | |
| | protected by per-class locks | |
| +--------------------+-----------------------+ |
| | |
| | when empty |
| v |
| +--------------------------------------------+ |
| | Page Heap | |
| | (spans of contiguous pages) | |
| | single lock | |
| +--------------------------------------------+ |
+--------------------------------------------------+
Size Classes
tcmalloc rounds up every allocation to a size class. Instead of handling arbitrary sizes, it uses a fixed set (e.g., 8, 16, 32, 48, 64, 80, 96, 128, … up to 256 KB). This eliminates fragmentation within a size class and makes free-list indexing trivial.
From the tcmalloc source (tcmalloc/size_classes.cc):
Request size Size class Wasted bytes
----------- ---------- ------------
1 - 8 8 0 - 7
9 - 16 16 0 - 7
17 - 32 32 0 - 15
33 - 48 48 0 - 15
49 - 64 64 0 - 15
... ... ...
The Thread Cache
Each thread maintains a per-size-class free list in thread-local storage:
// Conceptual structure (simplified from tcmalloc source)
struct ThreadCache {
FreeList list[kNumClasses]; // one free list per size class
size_t size; // total bytes in this cache
};
void *malloc(size_t size) {
int cl = SizeToClass(size); // O(1) lookup
ThreadCache *tc = GetThreadCache(); // thread-local
if (tc->list[cl].empty()) {
tc->FetchFromCentral(cl); // refill from central
}
return tc->list[cl].Pop(); // no lock needed!
}
void free(void *ptr) {
int cl = GetSizeClass(ptr); // from page metadata
ThreadCache *tc = GetThreadCache();
tc->list[cl].Push(ptr); // no lock needed!
if (tc->size > kMaxSize) {
tc->ReturnToCentral(); // give back excess
}
}
The critical insight: most malloc/free pairs never touch a lock. The thread cache absorbs the fast path entirely. Only when the cache runs empty or gets too large does the thread interact with the shared central heap — and even then, it transfers objects in batches to amortize the lock cost.
The Page Heap
For allocations larger than 256 KB (or when the central lists need more pages), tcmalloc goes to the page heap. This is a buddy-like structure that manages runs of contiguous pages called spans:
Page Heap (simplified):
free_lists[1] --> [1 page] --> [1 page] --> NULL
free_lists[2] --> [2 pages] --> NULL
free_lists[3] --> [3 pages] --> NULL
...
free_lists[128] --> [128 pages] --> NULL
large_spans --> sorted set of spans > 128 pages
A span can be “carved” into objects of a single size class. The span metadata records which size class it serves, so free() can determine the size class from just the pointer (look up the page, find the span, read the size class).
jemalloc: Arena-Based Allocation
Facebook’s jemalloc (used by FreeBSD libc, Redis, and Rust’s default allocator for a long time) takes a different approach to concurrency: instead of per-thread caches, it uses multiple arenas.
+--------------------------------------------------+
| Process |
| |
| Thread 1 Thread 2 Thread 3 |
| | | | |
| | assigned | assigned | assigned |
| v v v |
| +--------+ +--------+ +--------+ |
| | Arena | | Arena | | Arena | |
| | 0 | | 1 | | 0 | |
| +--------+ +--------+ +--------+ |
| |
| Typically: ncpus * 4 arenas |
| Threads round-robin across arenas |
+--------------------------------------------------+
Inside each arena:
+------------------------------------------+
| Arena |
| |
| Thread Cache (tcache) |
| +------------------------------------+ |
| | small bins: [8][16][32]...[14336] | |
| +------------------------------------+ |
| |
| Small allocations (< 14 KB): |
| +------------------------------------+ |
| | Slabs (runs of pages, divided | |
| | into equal-size regions) | |
| +------------------------------------+ |
| |
| Large allocations (>= 14 KB): |
| +------------------------------------+ |
| | Dedicated page runs | |
| +------------------------------------+ |
+------------------------------------------+
Slabs
jemalloc divides small allocations into size classes (similar to tcmalloc) and carves pages into slabs — contiguous runs of same-sized slots:
Slab for 64-byte objects (one 4 KB page = 63 slots + metadata):
+------+------+------+------+------+------+---+------+
| 64 B | 64 B | 64 B | 64 B | 64 B | 64 B |...| 64 B |
| obj | obj | FREE | obj | FREE | obj | | FREE |
+------+------+------+------+------+------+---+------+
^ ^ ^
| | |
+-- tracked by a bitmap ----------+
Bitmap: 1 1 0 1 0 1 ... 0
^ ^
used free
Instead of linked lists inside free objects, jemalloc uses a bitmap to track which slots are occupied. This has better cache behavior — the bitmap fits in a cache line or two, while a linked list scatters pointers across the slab.
Thread Caches (tcache)
jemalloc also has per-thread caches (confusingly also called “tcache”), but they work with the arena system. Each tcache holds a small magazine of objects per size class. The flow is:
- Fast path: Pop from tcache (no lock).
- Slow path: Refill tcache from the arena’s slab (arena lock).
- Slower path: Allocate a new slab from the arena’s page runs.
- Slowest path:
mmapnew pages from the OS.
Dirty Page Purging
One unique feature of jemalloc is its approach to returning memory to the OS. Instead of immediately munmap-ing freed pages, jemalloc marks them as “dirty” and periodically purges them using madvise(MADV_DONTNEED):
Page states in jemalloc:
+----------+ free() +----------+ decay timer +-----------+
| Active | -----------> | Dirty | ----------------> | Muzzy |
| (in use) | | (freed, | | (purged, |
+----------+ | still | | but page |
^ | resident| | tables |
| | in RAM) | | remain) |
| +----------+ +-----------+
| | |
+--- reused by malloc ----+ |
+--- reused by malloc -----------------------------------+
This decay-based approach avoids the overhead of frequent mmap/munmap calls while still allowing the OS to reclaim physical pages when memory is tight. The decay rate is configurable (default: 10 seconds for dirty, 10 seconds for muzzy).
Comparing the Allocators
Feature glibc ptmalloc2 tcmalloc jemalloc
------------------- ----------------- ---------------- ----------------
Concurrency model Per-thread arenas Thread cache + Multiple arenas
(limited count) central lists + thread caches
Small alloc speed Medium Very fast Fast
(arena lock) (lock-free path) (lock-free path)
Fragmentation Higher Lower Lowest
(aging free lists) (size classes) (slabs + bitmap)
Memory overhead Low Medium Medium
(minimal metadata) (per-thread $) (bitmap + runs)
Large alloc mmap above 128KB mmap above 256KB mmap above
strategy a few MB
Used by Linux default Google, Go(old) FreeBSD, Redis,
Chromium Firefox, Rust
A Complete malloc/free Example
Let’s trace through what happens in tcmalloc when you allocate and free 48 bytes:
malloc(48)
--------
1. Round up to size class: 48 bytes -> class 5 (48 B)
2. Check thread cache for class 5
+-- Not empty? Pop from free list. Done. (common case)
+-- Empty? Go to step 3.
3. Fetch a batch from central free list (lock per class)
+-- Not empty? Move batch to thread cache. Pop one. Done.
+-- Empty? Go to step 4.
4. Allocate a span from page heap (global lock)
+-- Find a span of appropriate size
+-- Carve it into 48-byte objects
+-- Move them to central free list
+-- Transfer batch to thread cache
+-- Pop one. Done.
free(ptr)
--------
1. Look up page -> find span -> find size class: 48 B, class 5
2. Push onto thread cache free list for class 5. Done. (common case)
+-- Thread cache too large? Go to step 3.
3. Return a batch to central free list (lock per class)
+-- Span entirely free? Return span to page heap.
Steps 1-2 for both malloc and free are the hot path — they involve zero locks and zero system calls. This is why tcmalloc achieves millions of allocations per second per thread.
Fragmentation: The Eternal Enemy
Even with all these techniques, fragmentation remains a challenge. There are two types:
External fragmentation:
Free memory exists but is scattered in non-contiguous blocks.
+------+----+------+----+------+----+------+
| used |FREE| used |FREE| used |FREE| used |
| 32 B |16B | 64 B |16B | 32 B |16B | 32 B |
+------+----+------+----+------+----+------+
Total free: 48 bytes. But malloc(48) fails because
no single contiguous block is 48 bytes.
Internal fragmentation:
Allocated block is larger than requested.
malloc(33) returns a 48-byte block (size class rounding).
+---------------------------------------------+
| 33 bytes used | 15 bytes wasted (padding) |
+---------------------------------------------+
Modern allocators accept some internal fragmentation (typically 10-15% waste from size-class rounding) as the price for fast, scalable allocation. The key metric is RSS (Resident Set Size) — how much physical memory the process actually uses. A good allocator minimizes RSS over the lifetime of the program, not just at any single point.
Debugging Allocator Issues
When things go wrong, these tools help:
/proc/self/maps— Shows all memory mappings (heap, mmap regions, shared libraries). Look for unexpected growth.mallinfo()/malloc_stats()— glibc functions that print heap statistics.MALLOC_CONF— jemalloc’s environment variable for runtime configuration and statistics (MALLOC_CONF=stats_print:true).tcmalloc::MallocExtension— tcmalloc’s C++ API for heap profiling and statistics.- Valgrind / AddressSanitizer — Detect use-after-free, double-free, buffer overflows, and leaks. ASan instruments allocator calls at compile time and catches errors with ~2x overhead.
# Compile with AddressSanitizer
gcc -fsanitize=address -g myprogram.c -o myprogram
./myprogram
# ASan reports:
# ==12345==ERROR: AddressSanitizer: heap-use-after-free on address 0x602000000010
References
- glibc malloc implementation
malloc/malloc.c - Doug Lea’s malloc design doc
- tcmalloc design doc
- tcmalloc source
tcmalloc/ - jemalloc source and docs repo
- jemalloc: A Scalable Concurrent malloc Implementation for FreeBSD paper
- Linux kernel buddy allocator
mm/page_alloc.c - Hoard: A Scalable Memory Allocator for Multithreaded Applications paper
- mimalloc: Free List Sharding in Action paper