Skip to content
JZLeetCode
Go back

System Design - How eBPF Works

Table of contents

Open Table of contents

Context

Imagine you want to add a feature to the Linux kernel — say, counting how many packets arrive on a network interface, or tracing every time a specific system call is invoked. Traditionally you had two choices:

  1. Modify the kernel source — recompile, reboot, wait. Any bug you introduce can crash the entire machine.
  2. Write a loadable kernel module (LKM) — faster iteration than a full rebuild, but still runs with full kernel privileges. A single off-by-one error can panic the system.

Both approaches are slow, dangerous, and require deep kernel expertise. What if we could run safe, sandboxed programs inside the kernel at near-native speed, without rebooting or risking stability?

That is exactly what eBPF (extended Berkeley Packet Filter) provides.

Brief History

Architecture Overview

Here is the end-to-end flow from user-space source code to execution inside the kernel:

                         eBPF Architecture

  User Space                              Kernel Space
  ----------                              ------------

  +----------------+
  | BPF C program  |   (restricted C)
  | (hello.bpf.c)  |
  +-------+--------+
          |
          | clang -target bpf -O2
          v
  +----------------+
  | BPF bytecode   |   (ELF object file)
  | (hello.bpf.o)  |
  +-------+--------+
          |
          | bpf() syscall (BPF_PROG_LOAD)
          v
  +-------+--------+        +------------------+
  |    Loader      | -----> |    Verifier      |
  | (libbpf/cilium)|        | (safety checks)  |
  +----------------+        +--------+---------+
                                     |
                              pass?  | yes
                                     v
                            +--------+---------+
                            |   JIT Compiler   |
                            | (bytecode->x86)  |
                            +--------+---------+
                                     |
                                     v
                            +--------+---------+
                            |  Attach to Hook  |
                            +------------------+
                                     |
            +------------+-----------+----------+----------+
            |            |           |          |          |
            v            v           v          v          v
        +-------+   +--------+  +------+  +--------+  +-------+
        |kprobe |   |trace-  |  | XDP  |  |  TC    |  |cgroup |
        |       |   |point   |  |      |  |(traffic|  |       |
        +-------+   +--------+  +------+  | ctrl)  |  +-------+
                                           +--------+

  Hook points: where BPF programs execute in kernel context

Key components:

The BPF Instruction Set

eBPF defines a RISC-like instruction set with 64-bit registers:

  Registers
  ---------
  R0        return value from helpers / program exit code
  R1-R5     function arguments (caller-saved)
  R6-R9     callee-saved registers
  R10       read-only frame pointer (stack base)

  Stack: 512 bytes (fixed, per program invocation)

Each instruction is 64 bits wide (8 bytes):

  Bit 63        Bit 32  Bit 31    Bit 16  Bit 15  Bit 12  Bit 11  Bit 8   Bit 7     Bit 0
  +-------------+-------+---------+-------+-------+-------+-------+-------+-----------+
  |  immediate  |       | offset  |       |  src  |       |  dst  |       |  opcode   |
  |  (32 bits)  |       |(16 bits)|       |(4 bit)|       |(4 bit)|       | (8 bits)  |
  +-------------+-------+---------+-------+-------+-------+-------+-------+-----------+

  Layout (struct bpf_insn):
    __u8  code;        // opcode
    __u8  dst_reg:4;   // destination register
    __u8  src_reg:4;   // source register
    __s16 off;         // signed offset
    __s32 imm;         // signed immediate

A Simple Bytecode Example

Consider this tiny BPF program that returns the value 42:

int return_42() {
    return 42;
}

The compiled bytecode (two instructions):

  Instruction 0:  mov64 R0, 42       // BPF_ALU64 | BPF_MOV | BPF_K
                                      // opcode=0xb7, dst=R0, imm=42
  Instruction 1:  exit                // BPF_JMP | BPF_EXIT
                                      // opcode=0x95

The instruction set is defined in include/uapi/linux/bpf.h.

The Verifier

The verifier is the gatekeeper that ensures no BPF program can crash or compromise the kernel. It runs before the program executes — this is purely static analysis at load time.

                     Verifier Flow

  +------------------+
  | BPF bytecode in  |
  +--------+---------+
           |
           v
  +--------+---------+
  | 1. CFG analysis  |   Build control-flow graph
  |    (DAG check)   |   Reject if back-edges found (no loops*)
  +--------+---------+
           |
           v
  +--------+---------+
  | 2. Walk every    |   Explore all paths through the program
  |    path          |   Track register state at each instruction
  +--------+---------+
           |
           v
  +--------+---------+
  | 3. Type/bounds   |   - Is R1 a valid pointer or scalar?
  |    checking      |   - Is memory access within bounds?
  |                  |   - Are map lookups NULL-checked?
  +--------+---------+
           |
           v
  +--------+---------+
  | 4. Stack depth   |   Max 512 bytes, no overflow
  |    check         |
  +--------+---------+
           |
           v
  +--------+---------+     +--------+
  | 5. Complexity    | --> | REJECT |  if instruction count > 1M
  |    limit         |     +--------+  or states exceed limit
  +--------+---------+
           |
           | all checks pass
           v
  +--------+---------+
  |   ACCEPT         |
  +------------------+

  * Since Linux 5.3, bounded loops are allowed if the verifier
    can prove termination (e.g., for-loops with known bounds).

Key safety properties enforced:

The verifier tracks a state for each register (type, min/max value, alignment) as it simulates execution along every possible path. If any path leads to an unsafe state, the program is rejected.

Source: kernel/bpf/verifier.c — this is one of the most complex files in the kernel (~20,000+ lines).

JIT Compilation

After the verifier approves a program, the kernel can JIT compile (Just-In-Time) the BPF bytecode into native machine instructions. This eliminates the overhead of interpreting bytecode at runtime.

  BPF bytecode                    x86-64 native code
  -------------                   -------------------
  mov64 R0, 42                    mov rax, 42
  exit                            ret

  (simplified; actual JIT handles calling conventions,
   prologue/epilogue, and register mapping)

Performance impact:

ModeOverhead vs native
Interpreter~1.5-2x slower
JIT compiled~1.0-1.1x (near native)

The JIT is enabled by default on modern kernels (net.core.bpf_jit_enable = 1). Each architecture has its own JIT backend:

The JIT maps BPF registers to hardware registers. On x86-64 for example:

  BPF Register    x86-64 Register
  ------------    ---------------
  R0              rax
  R1              rdi
  R2              rsi
  R3              rdx
  R4              rcx
  R5              r8
  R6              rbx
  R7              r13
  R8              r14
  R9              r15
  R10 (fp)        rbp

BPF Maps

BPF programs execute in kernel context and are event-driven — they run, do their work, and return. But what if you need to accumulate data across invocations (e.g., counting packets) or share data between the BPF program and user space?

BPF Maps solve this. They are key-value data structures that live in kernel memory and are accessible from both BPF programs (in kernel) and user-space applications (via the bpf() syscall).

  User Space                         Kernel Space
  ----------                         ------------

  +------------------+               +------------------+
  | User application |               | BPF program      |
  | (Python/Go/C)   |               | (runs at hook)   |
  +--------+---------+               +--------+---------+
           |                                  |
           | bpf(BPF_MAP_LOOKUP_ELEM)         | bpf_map_lookup_elem()
           | bpf(BPF_MAP_UPDATE_ELEM)         | bpf_map_update_elem()
           |                                  |
           v                                  v
           +----------------------------------+
           |           BPF Map               |
           |   (lives in kernel memory)      |
           |                                 |
           |   key (bytes) --> value (bytes)  |
           +---------------------------------+

Common Map Types

Map TypeDescription
BPF_MAP_TYPE_HASHGeneral-purpose hash table
BPF_MAP_TYPE_ARRAYFixed-size array, O(1) lookup by index
BPF_MAP_TYPE_RINGBUFEfficient single-producer ring buffer
BPF_MAP_TYPE_LRU_HASHHash table with LRU eviction
BPF_MAP_TYPE_PERCPU_HASHPer-CPU hash (no locking needed)
BPF_MAP_TYPE_PERCPU_ARRAYPer-CPU array
BPF_MAP_TYPE_PERF_EVENT_ARRAYFor streaming events to user space

Maps are created with specified key size, value size, and max entries. The kernel manages memory allocation and concurrency (using RCU or per-CPU copies depending on map type).

Source: kernel/bpf/hashtab.c, kernel/bpf/arraymap.c, kernel/bpf/ringbuf.c.

Helper Functions

BPF programs run in a restricted environment — they cannot call arbitrary kernel functions. Instead, they call a fixed set of helper functions exposed by the kernel. These are the BPF program’s API to interact with the outside world.

Each helper has a well-defined prototype and is called using a stable function ID:

// From include/uapi/linux/bpf.h (simplified)
enum bpf_func_id {
    BPF_FUNC_map_lookup_elem     = 1,
    BPF_FUNC_map_update_elem     = 2,
    BPF_FUNC_map_delete_elem     = 3,
    BPF_FUNC_probe_read          = 4,
    BPF_FUNC_ktime_get_ns        = 5,
    BPF_FUNC_get_current_pid_tgid = 14,
    BPF_FUNC_get_current_comm    = 16,
    BPF_FUNC_perf_event_output   = 25,
    BPF_FUNC_ringbuf_output      = 130,
    // ... hundreds more
};

Key Helpers

HelperPurpose
bpf_map_lookup_elem(map, key)Look up a value in a map by key
bpf_map_update_elem(map, key, val, flags)Insert or update a map entry
bpf_probe_read(dst, size, src)Safely read kernel memory into BPF stack
bpf_probe_read_user(dst, size, src)Safely read user-space memory
bpf_get_current_pid_tgid()Get current process PID and TGID
bpf_get_current_comm(buf, size)Get current process name (comm)
bpf_ktime_get_ns()Get monotonic clock in nanoseconds
bpf_perf_event_output(ctx, map, flags, data, size)Send event to user space
bpf_ringbuf_output(ringbuf, data, size, flags)Write to ring buffer
bpf_trace_printk(fmt, ...)Debug print to /sys/kernel/debug/tracing/trace_pipe

The verifier checks that each helper call matches the expected argument types (e.g., argument 1 must be a map pointer, argument 2 must point to a memory region of at least key_size bytes).

Source: kernel/bpf/helpers.c, net/core/filter.c (networking helpers).

Program Types and Attach Points

Not all BPF programs are the same. The program type determines what context the program receives, which helpers it can call, and where it can attach.

  +---------------------+-------------------------------------------+
  | Program Type        | Attach Point / Use Case                   |
  +---------------------+-------------------------------------------+
  | BPF_PROG_TYPE_XDP   | Network device ingress (earliest hook).   |
  |                     | Decisions: pass, drop, redirect, tx.      |
  +---------------------+-------------------------------------------+
  | BPF_PROG_TYPE_      | Classifier hook in Traffic Control.       |
  | SCHED_CLS (TC)      | Runs after XDP; can modify packets.       |
  +---------------------+-------------------------------------------+
  | BPF_PROG_TYPE_      | Attach to kprobes (any kernel function    |
  | KPROBE              | entry/exit). Used for tracing.            |
  +---------------------+-------------------------------------------+
  | BPF_PROG_TYPE_      | Attach to static tracepoints defined in   |
  | TRACEPOINT          | the kernel (e.g., syscall enter/exit).    |
  +---------------------+-------------------------------------------+
  | BPF_PROG_TYPE_      | Attach to socket operations. Filter or    |
  | SOCKET_FILTER       | observe packets on a socket.              |
  +---------------------+-------------------------------------------+
  | BPF_PROG_TYPE_      | Attach to cgroup hooks. Control resource   |
  | CGROUP_SKB          | access, network policy per container.     |
  +---------------------+-------------------------------------------+
  | BPF_PROG_TYPE_      | Attach to perf events (CPU cycles, cache  |
  | PERF_EVENT          | misses, etc.) for profiling.              |
  +---------------------+-------------------------------------------+
  | BPF_PROG_TYPE_      | Attach to Linux Security Module hooks.    |
  | LSM                 | Implement custom security policies.       |
  +---------------------+-------------------------------------------+

Each program type receives a context struct as its first argument (R1). For example:

A Complete Example: Tracing Syscalls

Let’s walk through a complete BPF program that counts system calls per process (by PID) using a hash map.

Step 1: The BPF Program (kernel side)

// syscall_count.bpf.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

// Define a hash map: key = PID (u32), value = count (u64)
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, __u32);
    __type(value, __u64);
} syscall_counts SEC(".maps");

// This program attaches to the raw_syscalls:sys_enter tracepoint.
// It fires on EVERY system call made by any process.
SEC("tracepoint/raw_syscalls/sys_enter")
int count_syscalls(void *ctx)
{
    // Get the current process's PID (lower 32 bits of pid_tgid)
    __u32 pid = bpf_get_current_pid_tgid() >> 32;

    // Look up existing count for this PID
    __u64 *count = bpf_map_lookup_elem(&syscall_counts, &pid);

    if (count) {
        // PID already seen: increment
        __sync_fetch_and_add(count, 1);
    } else {
        // First syscall from this PID: initialize to 1
        __u64 init_val = 1;
        bpf_map_update_elem(&syscall_counts, &pid, &init_val, BPF_ANY);
    }

    return 0;
}

char LICENSE[] SEC("license") = "GPL";

Key points:

Step 2: Compile to BPF bytecode

clang -target bpf -O2 -g -c syscall_count.bpf.c -o syscall_count.bpf.o

This produces an ELF object file containing BPF bytecode in the appropriate sections.

Step 3: User-space loader (reads results)

// loader.c (simplified, using libbpf)
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    struct bpf_object *obj;
    int prog_fd, map_fd;

    // Open and load the BPF object file
    obj = bpf_object__open_file("syscall_count.bpf.o", NULL);
    bpf_object__load(obj);

    // Find and attach the program
    struct bpf_program *prog = bpf_object__find_program_by_name(obj, "count_syscalls");
    struct bpf_link *link = bpf_program__attach(prog);

    // Get the map file descriptor
    map_fd = bpf_object__find_map_fd_by_name(obj, "syscall_counts");

    // Every 2 seconds, dump the top syscall counts
    while (1) {
        sleep(2);
        __u32 key, next_key;
        __u64 value;

        printf("\n--- Syscall counts by PID ---\n");
        key = 0;
        while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
            bpf_map_lookup_elem(map_fd, &next_key, &value);
            printf("  PID %u: %llu syscalls\n", next_key, value);
            key = next_key;
        }
    }

    bpf_link__destroy(link);
    bpf_object__close(obj);
    return 0;
}

Step 4: What happens at runtime

  +-------------------+                    +-------------------+
  | Any process       |                    | loader.c          |
  | (e.g., ls, cat)   |                    | (user space)      |
  +--------+----------+                    +--------+----------+
           |                                        |
           | syscall (open, read, write...)         |
           v                                        |
  +--------+------------------------------------------+---------+
  |                    KERNEL                                    |
  |                                                             |
  |  tracepoint: raw_syscalls/sys_enter fires                   |
  |       |                                                     |
  |       v                                                     |
  |  +----+------------------+          +-----------+           |
  |  | count_syscalls (BPF)  | -------> | hash map  | <---------+
  |  | get PID, increment    |  update  | pid->count|   lookup   |
  |  +-----------------------+          +-----------+           |
  +-------------------------------------------------------------+

Real-World Use Cases

eBPF has become foundational infrastructure in modern Linux deployments:

Networking: Cilium

Cilium replaces traditional iptables-based networking in Kubernetes with eBPF programs attached at XDP and TC hooks. Benefits:

Load Balancing: Katran

Katran (Meta) is an XDP-based L4 load balancer that handles millions of packets per second per core. By processing packets at the XDP hook (before the network stack allocates sk_buff), it achieves exceptional throughput with minimal CPU overhead.

Observability: bcc and bpftrace

bcc provides Python/Lua frontends for writing BPF tracing tools. bpftrace is a higher-level tracing language (like DTrace for Linux):

# Count syscalls by process name (one-liner)
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Histogram of read() latencies
bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; }
             kretprobe:vfs_read /@start[tid]/ {
               @us = hist((nsecs - @start[tid]) / 1000);
               delete(@start[tid]);
             }'

Security: Falco and Tetragon

Falco (CNCF) uses eBPF to monitor system calls in real time and detect anomalous behavior (e.g., a container spawning an unexpected shell, sensitive file access). Tetragon (Cilium) provides kernel-level security observability and enforcement using LSM and kprobe BPF programs.

Profiling: Continuous profiling

Tools like Pyroscope and Parca use BPF_PROG_TYPE_PERF_EVENT programs to collect stack traces with minimal overhead — enabling always-on CPU profiling in production.

Summary

  +-------+    compile     +---------+   load    +----------+
  | C src | ------------> | bytecode | -------> | verifier |
  +-------+    clang/LLVM +---------+   bpf()  +----+-----+
                                                     |
                                                pass | reject
                                                     v
                              +----------+     +-----+-----+
                              |  native  | <-- |    JIT    |
                              |   code   |     +-----------+
                              +----+-----+
                                   |
                                   v
                              attach to hook
                              (kprobe, XDP, tracepoint, ...)
                                   |
                                   v
                              +----+-----+
                              | BPF maps | <--- user space reads/writes
                              +----------+

eBPF gives you a safe, fast, and flexible way to extend kernel behavior without modifying kernel source or loading risky modules. The verifier ensures safety, the JIT ensures performance, and maps provide the communication channel between kernel and user space. This combination has made eBPF one of the most important Linux innovations of the past decade.

References

Share this post on:

Previous Post
LeetCode 647 Palindromic Substrings
Next Post
LeetCode 199 Binary Tree Right Side View