System Design - How CPU Branch Prediction Works

Open Table of contents

Context
Static Prediction: Simple Rules
Dynamic Prediction: Learning from History
- The Two-Bit Saturating Counter
- Branch History Table (BHT)
Two-Level Adaptive Prediction
- The Idea: Correlate with Recent History
- gshare: The Practical Compromise
The Branch Target Buffer (BTB)
Tournament Predictors: Best of Both Worlds
TAGE: The State of the Art
Real-World Impact: A Famous Example
Spectre: When Prediction Becomes a Vulnerability
Writing Prediction-Friendly Code
Summary
References

Context

Modern CPUs don’t execute one instruction at a time. They use a technique called pipelining — breaking each instruction into stages (fetch, decode, execute, write-back) and overlapping them like an assembly line. A 15-stage pipeline (common in modern x86 cores) can have 15 instructions in flight simultaneously.

This works beautifully for sequential code. But what happens at a branch — an if, for, while, or switch?

    Instruction stream:

    0x100:  cmp  rax, 10
    0x104:  jle  0x120       <-- branch: go to 0x120, or fall through to 0x108?
    0x108:  mov  rbx, 1      <-- "not taken" path
    0x10C:  ...
      ...
    0x120:  mov  rbx, 2      <-- "taken" path
    0x124:  ...

The CPU has already fetched instructions beyond the branch before it even knows the comparison result. If it fetched the wrong path, it must flush the pipeline — throw away all that speculative work and start over from the correct path. On a 15-stage pipeline, that’s potentially 15 wasted cycles.

  Pipeline stages for a branch instruction:

  Clock cycle:   1    2    3    4    5    6    7    8   ...  15
                 +----+----+----+----+----+----+----+----+---+----+
  branch instr:  | F  | D  | E  | ... (result known here)        |
                 +----+----+----+----+----+----+----+----+---+----+
  next instr 1:  |    | F  | D  | E  | <-- wasted if wrong path  |
  next instr 2:  |    |    | F  | D  | <-- wasted if wrong path  |
  next instr 3:  |    |    |    | F  | <-- wasted if wrong path  |
                 +----+----+----+----+----+----+----+----+---+----+

  If mispredicted: flush pipeline, restart from correct target
  Penalty: ~15 cycles on modern CPUs (varies by microarchitecture)

Branch prediction is the CPU’s solution: guess which way the branch will go, and start executing that path speculatively. If the guess is right, the pipeline keeps flowing without interruption. If wrong, pay the flush penalty.

Modern predictors achieve 95–99% accuracy on typical workloads. The difference between 95% and 99% is enormous: in a loop that iterates a million times, 95% accuracy means 50,000 pipeline flushes vs. 10,000.

Static Prediction: Simple Rules

The simplest prediction strategies require no history at all:

Always predict “not taken” — assume the branch falls through. This is what the earliest pipelined processors (like MIPS R2000) did. It works well for error-checking branches (if (error) goto handle_error) since errors are rare.

Backward taken, forward not taken (BTFNT) — assume backward branches (loops) are taken and forward branches are not. Loop branches jump backward to the loop head, so this heuristic captures the common case. Intel’s Pentium used BTFNT as a fallback.

  Static prediction heuristics:

  +------------------+     branch target < branch address?
  |  Branch at 0x200 |     (i.e., jumping backward?)
  +--------+---------+
           |
     +-----+-----+
     |           |
     v           v
   YES          NO
  (backward)   (forward)
     |           |
     v           v
  Predict     Predict
  TAKEN       NOT TAKEN
  (loop       (error check,
   back)       unlikely path)

Static prediction gives roughly 60–70% accuracy. Not bad, but not good enough for modern out-of-order processors that suffer heavily from mispredictions.

Dynamic Prediction: Learning from History

Dynamic predictors observe what a branch did in the past and use that to predict its future. The key insight: branch behavior is often repetitive. A loop branch is taken N−1 times then not-taken once. A data-dependent branch may follow a pattern tied to the data structure it’s iterating.

The Two-Bit Saturating Counter

The foundational building block is a two-bit saturating counter — a tiny state machine with four states:

                    Two-Bit Saturating Counter

          taken              taken              taken
    +---+       +---+              +---+              +---+
    |   v       |   v              |   v              |   v
  +-------+   +-------+        +---------+        +---------+
  |  00   |   |  01   |        |   10    |        |   11    |
  |Strongly|   |Weakly |        | Weakly  |        |Strongly |
  |  Not   |-->|  Not  |------->| Taken   |------->| Taken   |
  | Taken  |   | Taken |        |         |        |         |
  +-------+   +-------+        +---------+        +---------+
    ^   |       ^   |              ^   |              ^   |
    +---+       +---+              +---+              +---+
  not taken   not taken          not taken          not taken

  Prediction:  NOT TAKEN          |           TAKEN
               (states 00, 01)    |    (states 10, 11)

Why two bits instead of one? A one-bit counter flips prediction after every misprediction. Consider a loop that iterates 100 times: it’s taken 99 times, then not-taken once at the exit. A one-bit predictor would mispredict twice per loop iteration — once at the exit, and once re-entering the loop. A two-bit counter only mispredicts once, at the exit, because the single “not taken” only moves it from “strongly taken” to “weakly taken.”

Branch History Table (BHT)

A Branch History Table (also called a Pattern History Table) is an array of two-bit counters, indexed by some bits of the branch’s program counter (PC):

  Branch History Table (BHT)
  Indexed by lower bits of branch PC

  Branch PC: 0x7FFA0104
                     ||||
                     vvvv
  Index = lower N bits of PC (e.g., bits [9:2])

  +-------+-------------------+
  | Index | 2-bit counter     |
  +-------+-------------------+
  |  000  |  11 (ST)          |
  |  001  |  01 (WN)          |
  |  010  |  10 (WT)          |  <-- this entry
  |  011  |  11 (ST)          |
  |  ...  |  ...              |
  |  255  |  00 (SN)          |
  +-------+-------------------+

  ST = Strongly Taken,  WT = Weakly Taken
  WN = Weakly Not-Taken, SN = Strongly Not-Taken

When the CPU encounters a branch:

Hash the PC to get a table index.
Read the counter: if >= 2 (states 10 or 11), predict taken; otherwise not-taken.
After the branch resolves, update the counter: increment on taken, decrement on not-taken (saturating at 0 and 3).

Problem: different branches can alias to the same table entry, polluting each other’s counters. This is handled with larger tables and better hashing (XOR with global history).

Two-Level Adaptive Prediction

Simple BHTs can’t capture patterns. Consider:

for (int i = 0; i < 4; i++) { ... }

The loop branch follows the pattern TTTN (taken, taken, taken, not-taken) repeating forever. A two-bit counter will always mispredict the “not-taken” at the exit. A two-level predictor can learn this pattern perfectly.

The Idea: Correlate with Recent History

A two-level predictor uses a Branch History Register (BHR) — a shift register that records the last K outcomes (taken/not-taken) of recent branches:

  Two-Level Adaptive Predictor

  Step 1: Record recent branch outcomes in a shift register

  Branch History Register (BHR): K bits wide
  +---+---+---+---+---+---+---+---+
  | T | T | N | T | T | T | N | T |   (T=taken, N=not-taken)
  +---+---+---+---+---+---+---+---+
    ^                           ^
    newest                    oldest

  Step 2: Use BHR to index into a Pattern History Table (PHT)

  BHR value (e.g., 8 bits) = index into PHT
                |
                v
  +-------+-------------------+
  | Index | 2-bit counter     |
  +-------+-------------------+
  | 0x00  |  ...              |
  | ...   |  ...              |
  | 0xAD  |  11 (predict T)   |  <-- BHR = 10101101
  | ...   |  ...              |
  | 0xFF  |  ...              |
  +-------+-------------------+

  Each unique history pattern maps to its own counter.
  The TTTN loop gets its own PHT entry that learns "after TTT, predict N."

This is the GAs (global history, all branches, single PHT) predictor described by Yeh and Patt in 1991. Variants include:

GAg: Global history, one shared PHT (simple, prone to aliasing)
GAp: Global history, per-address PHTs (less aliasing, more hardware)
PAg: Per-address history registers, one shared PHT
PAp: Per-address everything (most accurate, most expensive)

gshare: The Practical Compromise

Scott McFarling’s gshare (1993) is elegantly simple and still forms the basis of modern predictors. It XORs the global branch history with the branch PC to index a single table of two-bit counters:

  gshare predictor

  Global History Register (GHR):  10110101  (last 8 branch outcomes)
  Branch PC (lower bits):         01001010

  XOR them together:
    10110101
  ^ 01001010
  ----------
    11111111  = index into PHT

  +-------+-------------------+
  | Index | 2-bit counter     |
  +-------+-------------------+
  | ...   |                   |
  | 0xFF  |  10 (predict T)   |  <-- use this counter
  | ...   |                   |
  +-------+-------------------+

  Why XOR? It spreads entries across the table, reducing aliasing
  between branches that share the same PC bits but have different
  histories.

gshare captures both per-branch patterns (via PC bits) and inter-branch correlations (via global history). For example, if branch B is almost always taken after branch A is not-taken, gshare learns this because the GHR encodes A’s outcome when B is encountered.

The Branch Target Buffer (BTB)

Predicting taken vs. not-taken is only half the problem. For taken branches, the CPU also needs to know where to fetch from — the target address. The Branch Target Buffer is a cache that maps branch PCs to their targets:

  Branch Target Buffer (BTB)

  +------------------+------------------+------+
  |  Branch PC (tag) |  Target Address  | Type |
  +------------------+------------------+------+
  |  0x7FFA0104      |  0x7FFA0080      | Cond |
  |  0x7FFA0200      |  0x7FFA1000      | Call |
  |  0x7FFA0300      |  (varies)        | Ret  |  <-- indirect
  +------------------+------------------+------+

  Lookup happens in parallel with instruction fetch:
  - If hit and predicted taken: fetch from target address next cycle
  - If miss: assume not a branch, fetch sequentially

For indirect branches (function pointers, virtual calls, switch via jump tables), the target changes at runtime. Modern CPUs use an Indirect Target Array — essentially a BTB indexed by history — to predict these.

For return instructions, CPUs maintain a Return Address Stack (RAS) — a small hardware stack that pushes the return address on every call and pops on every ret. This is nearly perfect for well-behaved call/return pairs.

Tournament Predictors: Best of Both Worlds

Different predictors excel at different branch patterns. A tournament predictor (also called a hybrid predictor) runs multiple predictors in parallel and uses a chooser (another array of two-bit counters) to select which one to trust for each branch:

  Tournament Predictor (e.g., Alpha 21264)

  +-------------------+       +-------------------+
  |  Local Predictor  |       | Global Predictor  |
  | (per-branch       |       | (gshare-style,    |
  |  history + PHT)   |       |  inter-branch     |
  |                   |       |  correlations)    |
  +--------+----------+       +---------+---------+
           |                            |
           v                            v
      prediction L                 prediction G
           |                            |
           +----------+    +------------+
                      |    |
                      v    v
              +------------------+
              |    Chooser       |
              | (2-bit counters  |
              |  per branch)     |
              +--------+---------+
                       |
                       v
                Final prediction

  Chooser update rule:
  - If L correct and G wrong: decrement (favor L)
  - If G correct and L wrong: increment (favor G)
  - If both right or both wrong: no change

The Alpha 21264 processor (1998) famously used this design with a 1K-entry local predictor and a 4K-entry global predictor, achieving over 95% accuracy.

TAGE: The State of the Art

Modern high-performance CPUs (Intel since Haswell, AMD since Zen) use TAGE (TAgged GEometric history length) predictors, proposed by André Seznec in 2006.

The key insight: different branches need different amounts of history. A tight loop needs only 4 bits of history, while a branch that depends on a distant earlier branch might need 200+ bits. TAGE uses multiple tables with geometrically increasing history lengths:

  TAGE Predictor Structure

  History lengths: 0, 4, 8, 16, 32, 64, 128, 256 (geometric series)

  +----------+   +----------+   +----------+   +----------+   +----------+
  |  Base    |   | Table 1  |   | Table 2  |   | Table 3  |   | Table 4  |
  | Predictor|   | hist=4   |   | hist=8   |   | hist=16  |   | hist=32  |  ...
  | (bimodal)|   |          |   |          |   |          |   |          |
  +----+-----+   +----+-----+   +----+-----+   +----+-----+   +----+-----+
       |              |              |              |              |
       v              v              v              v              v
    pred_0         pred_1         pred_2         pred_3         pred_4
       |              |              |              |              |
       +-------+------+------+------+------+------+
               |
               v
        Use the prediction from the LONGEST matching
        history table (highest confidence)

  Each tagged table entry:
  +------+--------+----------+--------+
  | tag  | ctr    | useful   | (pad)  |
  | 10b  | 3b     | 2b       |        |
  +------+--------+----------+--------+

  tag:    partial tag to detect aliasing
  ctr:    prediction counter (like 2-bit, but 3-bit)
  useful: how valuable this entry is (for replacement)

TAGE achieves ~96% accuracy on SPEC benchmarks and over 99% on many real workloads. Its tagged design means aliasing is detected (tag mismatch = no prediction from that table), and the geometric history lengths mean it naturally adapts to each branch’s needs.

Real-World Impact: A Famous Example

The most cited branch prediction example comes from a Stack Overflow answer that went viral. Consider sorting an array and then branching on each element:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

int main() {
    const int SIZE = 32768;
    int data[SIZE];

    srand(time(NULL));
    for (int i = 0; i < SIZE; i++)
        data[i] = rand() % 256;

    // Try with sorted data (predictable branches)
    // vs unsorted data (unpredictable branches)
    // qsort(data, SIZE, sizeof(int), compare);

    long long sum = 0;
    for (int t = 0; t < 100000; t++) {
        for (int i = 0; i < SIZE; i++) {
            if (data[i] >= 128)  // <-- this branch
                sum += data[i];
        }
    }
    printf("sum = %lld\n", sum);
    return 0;
}

Results on a typical x86 machine:

  Sorted data:    ~5 seconds    (branch is predictable: NNNN...NTTT...T)
  Unsorted data:  ~15 seconds   (branch is random: TNTNNTTN...)

  3x slowdown entirely from branch mispredictions!

With sorted data, the branch follows a pattern: all “not taken” for values < 128, then all “taken” for values >= 128. The predictor learns this in a few iterations and achieves near-perfect accuracy.

With unsorted data, each branch outcome is essentially random (50/50). No predictor can do better than ~50% accuracy on truly random data. The result: roughly half the branches mispredict, causing pipeline flushes.

The branchless alternative avoids the problem entirely:

// Branchless: no branch to predict
int t = (data[i] - 128) >> 31;  // t = 0 if data[i] >= 128, -1 otherwise
sum += ~t & data[i];             // add data[i] if t == 0

Spectre: When Prediction Becomes a Vulnerability

Branch prediction’s speculative nature led to Spectre (2018), one of the most significant hardware vulnerabilities ever discovered. The attack exploits a subtle property: even though mispredicted instructions are “rolled back,” they leave traces in the CPU cache.

  Spectre Attack (variant 1, bounds check bypass):

  // Attacker trains the predictor: x is always in-bounds
  for (many iterations) { victim_function(valid_x); }

  // Then attacker calls with out-of-bounds x
  victim_function(malicious_x);

  void victim_function(size_t x) {
      if (x < array1_size) {             // <-- predictor says TAKEN
          // Speculatively executes with malicious_x:
          y = array2[array1[x] * 256];   // loads secret, caches array2 line
      }
  }

  // After rollback: array2 cache state reveals the secret byte
  // Attacker probes array2 with timing to extract array1[malicious_x]

The fix isn’t to disable prediction (that would destroy performance). Instead, mitigations include:

Retpolines: Replace indirect branches with a construct that “traps” the predictor into an infinite loop, preventing speculative execution down attacker-controlled paths.
IBRS/STIBP: Microcode updates that restrict indirect branch prediction across privilege boundaries.
Compiler barriers: lfence instructions that serialize execution at critical points.

Writing Prediction-Friendly Code

While compilers and hardware handle most cases, there are patterns that help:

Sort data before branching on it — creates long runs of same-direction branches.
Use branchless alternatives — cmov, bitwise tricks, ?: that compilers turn into conditional moves.

Use __builtin_expect — hints the compiler to lay out code so the common path is fall-through:

if (__builtin_expect(error_condition, 0)) {
    handle_error();  // compiler puts this out-of-line
}

Avoid data-dependent branches in hot loops — replace if (x > threshold) with arithmetic equivalents when possible.

Profile with perf stat — check actual misprediction rates:

perf stat -e branches,branch-misses ./my_program
# Output shows total branches and misprediction rate

Summary

  Evolution of Branch Prediction

  Era         Technique             Accuracy    Used In
  ---------   -------------------   --------    ------------------
  1980s       Static (BTFNT)        60-70%      MIPS, SPARC
  Early 90s   2-bit BHT             80-85%      i486, Pentium
  Mid 90s     Two-level / gshare    90-93%      Pentium Pro, Alpha
  Late 90s    Tournament            95-97%      Alpha 21264, P4
  2000s+      TAGE                  96-99%      Haswell+, Zen+
  2020s       Perceptron hybrids    97-99.5%    Apple M-series

Branch prediction is one of those invisible mechanisms that makes modern computing fast. A 15-stage pipeline running at 5 GHz can execute billions of instructions per second — but only if the predictor keeps the pipeline fed with correct-path instructions. The 50+ years of research in this area have produced remarkably effective solutions, turning what could be a crippling performance bottleneck into a nearly transparent optimization.

References

J. E. Smith, “A study of branch prediction strategies,” ISCA 1981 — the foundational paper on two-bit counters.
T.-Y. Yeh and Y. N. Patt, “Two-Level Adaptive Training Branch Prediction,” MICRO 1991 — introduced two-level adaptive prediction.
S. McFarling, “Combining Branch Predictors,” DEC WRL Technical Note TN-36, 1993 — introduced gshare and hybrid/tournament predictors.
A. Seznec, “A 256 Kbits L-TAGE branch predictor,” JILP 2006 — the TAGE predictor used in modern CPUs.
P. Kocher et al., “Spectre Attacks: Exploiting Speculative Execution,” IEEE S&P 2019 — the Spectre vulnerability paper.
Intel 64 and IA-32 Architectures Optimization Reference Manual, Chapter 3 — Intel’s official guidance on branch prediction behavior.
Agner Fog, “The microarchitecture of Intel, AMD, and VIA CPUs” — detailed reverse-engineering of branch predictor implementations.