Table of contents
Open Table of contents
Context
Imagine you want to add a feature to the Linux kernel — say, counting how many packets arrive on a network interface, or tracing every time a specific system call is invoked. Traditionally you had two choices:
- Modify the kernel source — recompile, reboot, wait. Any bug you introduce can crash the entire machine.
- Write a loadable kernel module (LKM) — faster iteration than a full rebuild, but still runs with full kernel privileges. A single off-by-one error can panic the system.
Both approaches are slow, dangerous, and require deep kernel expertise. What if we could run safe, sandboxed programs inside the kernel at near-native speed, without rebooting or risking stability?
That is exactly what eBPF (extended Berkeley Packet Filter) provides.
Brief History
- 1992 — Steven McCanne and Van Jacobson created BPF (Berkeley Packet Filter) for efficient packet capture in BSD. It introduced a small virtual machine inside the kernel with a minimal instruction set. Linux adopted it for
tcpdumpand socket filters. - 2014 — Alexei Starovoitov rewrote BPF from scratch in the Linux kernel (merged in Linux 3.15-3.18). The new version — eBPF — expanded the register set to 64-bit, added maps for persistent state, introduced a verifier for safety, and allowed attachment to many more kernel hook points beyond networking.
- Today — eBPF is used for observability (bpftrace, Pixie), networking (Cilium, Katran), and security (Falco, Tetragon) in production at scale.
Architecture Overview
Here is the end-to-end flow from user-space source code to execution inside the kernel:
eBPF Architecture
User Space Kernel Space
---------- ------------
+----------------+
| BPF C program | (restricted C)
| (hello.bpf.c) |
+-------+--------+
|
| clang -target bpf -O2
v
+----------------+
| BPF bytecode | (ELF object file)
| (hello.bpf.o) |
+-------+--------+
|
| bpf() syscall (BPF_PROG_LOAD)
v
+-------+--------+ +------------------+
| Loader | -----> | Verifier |
| (libbpf/cilium)| | (safety checks) |
+----------------+ +--------+---------+
|
pass? | yes
v
+--------+---------+
| JIT Compiler |
| (bytecode->x86) |
+--------+---------+
|
v
+--------+---------+
| Attach to Hook |
+------------------+
|
+------------+-----------+----------+----------+
| | | | |
v v v v v
+-------+ +--------+ +------+ +--------+ +-------+
|kprobe | |trace- | | XDP | | TC | |cgroup |
| | |point | | | |(traffic| | |
+-------+ +--------+ +------+ | ctrl) | +-------+
+--------+
Hook points: where BPF programs execute in kernel context
Key components:
- Compiler (clang/LLVM) — compiles restricted C to BPF bytecode targeting the BPF instruction set.
- Loader — uses the
bpf()system call to submit bytecode to the kernel. - Verifier — statically analyzes the program to guarantee safety (no crashes, no infinite loops, bounded memory access).
- JIT compiler — translates verified bytecode to native machine instructions (x86, ARM, etc.) for near-native execution speed.
- Hook points — locations in the kernel where BPF programs can attach and execute.
The BPF Instruction Set
eBPF defines a RISC-like instruction set with 64-bit registers:
Registers
---------
R0 return value from helpers / program exit code
R1-R5 function arguments (caller-saved)
R6-R9 callee-saved registers
R10 read-only frame pointer (stack base)
Stack: 512 bytes (fixed, per program invocation)
Each instruction is 64 bits wide (8 bytes):
Bit 63 Bit 32 Bit 31 Bit 16 Bit 15 Bit 12 Bit 11 Bit 8 Bit 7 Bit 0
+-------------+-------+---------+-------+-------+-------+-------+-------+-----------+
| immediate | | offset | | src | | dst | | opcode |
| (32 bits) | |(16 bits)| |(4 bit)| |(4 bit)| | (8 bits) |
+-------------+-------+---------+-------+-------+-------+-------+-------+-----------+
Layout (struct bpf_insn):
__u8 code; // opcode
__u8 dst_reg:4; // destination register
__u8 src_reg:4; // source register
__s16 off; // signed offset
__s32 imm; // signed immediate
A Simple Bytecode Example
Consider this tiny BPF program that returns the value 42:
int return_42() {
return 42;
}
The compiled bytecode (two instructions):
Instruction 0: mov64 R0, 42 // BPF_ALU64 | BPF_MOV | BPF_K
// opcode=0xb7, dst=R0, imm=42
Instruction 1: exit // BPF_JMP | BPF_EXIT
// opcode=0x95
The instruction set is defined in include/uapi/linux/bpf.h.
The Verifier
The verifier is the gatekeeper that ensures no BPF program can crash or compromise the kernel. It runs before the program executes — this is purely static analysis at load time.
Verifier Flow
+------------------+
| BPF bytecode in |
+--------+---------+
|
v
+--------+---------+
| 1. CFG analysis | Build control-flow graph
| (DAG check) | Reject if back-edges found (no loops*)
+--------+---------+
|
v
+--------+---------+
| 2. Walk every | Explore all paths through the program
| path | Track register state at each instruction
+--------+---------+
|
v
+--------+---------+
| 3. Type/bounds | - Is R1 a valid pointer or scalar?
| checking | - Is memory access within bounds?
| | - Are map lookups NULL-checked?
+--------+---------+
|
v
+--------+---------+
| 4. Stack depth | Max 512 bytes, no overflow
| check |
+--------+---------+
|
v
+--------+---------+ +--------+
| 5. Complexity | --> | REJECT | if instruction count > 1M
| limit | +--------+ or states exceed limit
+--------+---------+
|
| all checks pass
v
+--------+---------+
| ACCEPT |
+------------------+
* Since Linux 5.3, bounded loops are allowed if the verifier
can prove termination (e.g., for-loops with known bounds).
Key safety properties enforced:
- No unbounded loops — guarantees termination.
- No out-of-bounds memory access — every pointer dereference is bounds-checked.
- No reading uninitialized memory — registers and stack must be written before read.
- Pointer arithmetic restrictions — you cannot cast arbitrary integers to pointers.
- NULL checks after map lookups —
bpf_map_lookup_elem()can return NULL; you must check.
The verifier tracks a state for each register (type, min/max value, alignment) as it simulates execution along every possible path. If any path leads to an unsafe state, the program is rejected.
Source: kernel/bpf/verifier.c — this is one of the most complex files in the kernel (~20,000+ lines).
JIT Compilation
After the verifier approves a program, the kernel can JIT compile (Just-In-Time) the BPF bytecode into native machine instructions. This eliminates the overhead of interpreting bytecode at runtime.
BPF bytecode x86-64 native code
------------- -------------------
mov64 R0, 42 mov rax, 42
exit ret
(simplified; actual JIT handles calling conventions,
prologue/epilogue, and register mapping)
Performance impact:
| Mode | Overhead vs native |
|---|---|
| Interpreter | ~1.5-2x slower |
| JIT compiled | ~1.0-1.1x (near native) |
The JIT is enabled by default on modern kernels (net.core.bpf_jit_enable = 1). Each architecture has its own JIT backend:
- x86-64:
arch/x86/net/bpf_jit_comp.c - ARM64:
arch/arm64/net/bpf_jit_comp.c
The JIT maps BPF registers to hardware registers. On x86-64 for example:
BPF Register x86-64 Register
------------ ---------------
R0 rax
R1 rdi
R2 rsi
R3 rdx
R4 rcx
R5 r8
R6 rbx
R7 r13
R8 r14
R9 r15
R10 (fp) rbp
BPF Maps
BPF programs execute in kernel context and are event-driven — they run, do their work, and return. But what if you need to accumulate data across invocations (e.g., counting packets) or share data between the BPF program and user space?
BPF Maps solve this. They are key-value data structures that live in kernel memory and are accessible from both BPF programs (in kernel) and user-space applications (via the bpf() syscall).
User Space Kernel Space
---------- ------------
+------------------+ +------------------+
| User application | | BPF program |
| (Python/Go/C) | | (runs at hook) |
+--------+---------+ +--------+---------+
| |
| bpf(BPF_MAP_LOOKUP_ELEM) | bpf_map_lookup_elem()
| bpf(BPF_MAP_UPDATE_ELEM) | bpf_map_update_elem()
| |
v v
+----------------------------------+
| BPF Map |
| (lives in kernel memory) |
| |
| key (bytes) --> value (bytes) |
+---------------------------------+
Common Map Types
| Map Type | Description |
|---|---|
BPF_MAP_TYPE_HASH | General-purpose hash table |
BPF_MAP_TYPE_ARRAY | Fixed-size array, O(1) lookup by index |
BPF_MAP_TYPE_RINGBUF | Efficient single-producer ring buffer |
BPF_MAP_TYPE_LRU_HASH | Hash table with LRU eviction |
BPF_MAP_TYPE_PERCPU_HASH | Per-CPU hash (no locking needed) |
BPF_MAP_TYPE_PERCPU_ARRAY | Per-CPU array |
BPF_MAP_TYPE_PERF_EVENT_ARRAY | For streaming events to user space |
Maps are created with specified key size, value size, and max entries. The kernel manages memory allocation and concurrency (using RCU or per-CPU copies depending on map type).
Source: kernel/bpf/hashtab.c, kernel/bpf/arraymap.c, kernel/bpf/ringbuf.c.
Helper Functions
BPF programs run in a restricted environment — they cannot call arbitrary kernel functions. Instead, they call a fixed set of helper functions exposed by the kernel. These are the BPF program’s API to interact with the outside world.
Each helper has a well-defined prototype and is called using a stable function ID:
// From include/uapi/linux/bpf.h (simplified)
enum bpf_func_id {
BPF_FUNC_map_lookup_elem = 1,
BPF_FUNC_map_update_elem = 2,
BPF_FUNC_map_delete_elem = 3,
BPF_FUNC_probe_read = 4,
BPF_FUNC_ktime_get_ns = 5,
BPF_FUNC_get_current_pid_tgid = 14,
BPF_FUNC_get_current_comm = 16,
BPF_FUNC_perf_event_output = 25,
BPF_FUNC_ringbuf_output = 130,
// ... hundreds more
};
Key Helpers
| Helper | Purpose |
|---|---|
bpf_map_lookup_elem(map, key) | Look up a value in a map by key |
bpf_map_update_elem(map, key, val, flags) | Insert or update a map entry |
bpf_probe_read(dst, size, src) | Safely read kernel memory into BPF stack |
bpf_probe_read_user(dst, size, src) | Safely read user-space memory |
bpf_get_current_pid_tgid() | Get current process PID and TGID |
bpf_get_current_comm(buf, size) | Get current process name (comm) |
bpf_ktime_get_ns() | Get monotonic clock in nanoseconds |
bpf_perf_event_output(ctx, map, flags, data, size) | Send event to user space |
bpf_ringbuf_output(ringbuf, data, size, flags) | Write to ring buffer |
bpf_trace_printk(fmt, ...) | Debug print to /sys/kernel/debug/tracing/trace_pipe |
The verifier checks that each helper call matches the expected argument types (e.g., argument 1 must be a map pointer, argument 2 must point to a memory region of at least key_size bytes).
Source: kernel/bpf/helpers.c, net/core/filter.c (networking helpers).
Program Types and Attach Points
Not all BPF programs are the same. The program type determines what context the program receives, which helpers it can call, and where it can attach.
+---------------------+-------------------------------------------+
| Program Type | Attach Point / Use Case |
+---------------------+-------------------------------------------+
| BPF_PROG_TYPE_XDP | Network device ingress (earliest hook). |
| | Decisions: pass, drop, redirect, tx. |
+---------------------+-------------------------------------------+
| BPF_PROG_TYPE_ | Classifier hook in Traffic Control. |
| SCHED_CLS (TC) | Runs after XDP; can modify packets. |
+---------------------+-------------------------------------------+
| BPF_PROG_TYPE_ | Attach to kprobes (any kernel function |
| KPROBE | entry/exit). Used for tracing. |
+---------------------+-------------------------------------------+
| BPF_PROG_TYPE_ | Attach to static tracepoints defined in |
| TRACEPOINT | the kernel (e.g., syscall enter/exit). |
+---------------------+-------------------------------------------+
| BPF_PROG_TYPE_ | Attach to socket operations. Filter or |
| SOCKET_FILTER | observe packets on a socket. |
+---------------------+-------------------------------------------+
| BPF_PROG_TYPE_ | Attach to cgroup hooks. Control resource |
| CGROUP_SKB | access, network policy per container. |
+---------------------+-------------------------------------------+
| BPF_PROG_TYPE_ | Attach to perf events (CPU cycles, cache |
| PERF_EVENT | misses, etc.) for profiling. |
+---------------------+-------------------------------------------+
| BPF_PROG_TYPE_ | Attach to Linux Security Module hooks. |
| LSM | Implement custom security policies. |
+---------------------+-------------------------------------------+
Each program type receives a context struct as its first argument (R1). For example:
- XDP programs get
struct xdp_md *ctx(packet data pointers). - Kprobe programs get
struct pt_regs *ctx(CPU register state). - Tracepoint programs get the tracepoint-specific struct.
A Complete Example: Tracing Syscalls
Let’s walk through a complete BPF program that counts system calls per process (by PID) using a hash map.
Step 1: The BPF Program (kernel side)
// syscall_count.bpf.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
// Define a hash map: key = PID (u32), value = count (u64)
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10240);
__type(key, __u32);
__type(value, __u64);
} syscall_counts SEC(".maps");
// This program attaches to the raw_syscalls:sys_enter tracepoint.
// It fires on EVERY system call made by any process.
SEC("tracepoint/raw_syscalls/sys_enter")
int count_syscalls(void *ctx)
{
// Get the current process's PID (lower 32 bits of pid_tgid)
__u32 pid = bpf_get_current_pid_tgid() >> 32;
// Look up existing count for this PID
__u64 *count = bpf_map_lookup_elem(&syscall_counts, &pid);
if (count) {
// PID already seen: increment
__sync_fetch_and_add(count, 1);
} else {
// First syscall from this PID: initialize to 1
__u64 init_val = 1;
bpf_map_update_elem(&syscall_counts, &pid, &init_val, BPF_ANY);
}
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Key points:
SEC(".maps")tells the loader this is a map definition.SEC("tracepoint/raw_syscalls/sys_enter")specifies the attach point.bpf_get_current_pid_tgid()returns a 64-bit value: upper 32 bits = TGID (what user space calls PID), lower 32 bits = kernel TID.- We must NULL-check the result of
bpf_map_lookup_elem— the verifier enforces this.
Step 2: Compile to BPF bytecode
clang -target bpf -O2 -g -c syscall_count.bpf.c -o syscall_count.bpf.o
This produces an ELF object file containing BPF bytecode in the appropriate sections.
Step 3: User-space loader (reads results)
// loader.c (simplified, using libbpf)
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
#include <stdio.h>
#include <unistd.h>
int main() {
struct bpf_object *obj;
int prog_fd, map_fd;
// Open and load the BPF object file
obj = bpf_object__open_file("syscall_count.bpf.o", NULL);
bpf_object__load(obj);
// Find and attach the program
struct bpf_program *prog = bpf_object__find_program_by_name(obj, "count_syscalls");
struct bpf_link *link = bpf_program__attach(prog);
// Get the map file descriptor
map_fd = bpf_object__find_map_fd_by_name(obj, "syscall_counts");
// Every 2 seconds, dump the top syscall counts
while (1) {
sleep(2);
__u32 key, next_key;
__u64 value;
printf("\n--- Syscall counts by PID ---\n");
key = 0;
while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
bpf_map_lookup_elem(map_fd, &next_key, &value);
printf(" PID %u: %llu syscalls\n", next_key, value);
key = next_key;
}
}
bpf_link__destroy(link);
bpf_object__close(obj);
return 0;
}
Step 4: What happens at runtime
+-------------------+ +-------------------+
| Any process | | loader.c |
| (e.g., ls, cat) | | (user space) |
+--------+----------+ +--------+----------+
| |
| syscall (open, read, write...) |
v |
+--------+------------------------------------------+---------+
| KERNEL |
| |
| tracepoint: raw_syscalls/sys_enter fires |
| | |
| v |
| +----+------------------+ +-----------+ |
| | count_syscalls (BPF) | -------> | hash map | <---------+
| | get PID, increment | update | pid->count| lookup |
| +-----------------------+ +-----------+ |
+-------------------------------------------------------------+
Real-World Use Cases
eBPF has become foundational infrastructure in modern Linux deployments:
Networking: Cilium
Cilium replaces traditional iptables-based networking in Kubernetes with eBPF programs attached at XDP and TC hooks. Benefits:
- O(1) packet processing (BPF hash maps) vs. O(n) iptables rule chains.
- Load balancing (service mesh) without sidecar proxies.
- Network policy enforcement at the kernel level.
Load Balancing: Katran
Katran (Meta) is an XDP-based L4 load balancer that handles millions of packets per second per core. By processing packets at the XDP hook (before the network stack allocates sk_buff), it achieves exceptional throughput with minimal CPU overhead.
Observability: bcc and bpftrace
bcc provides Python/Lua frontends for writing BPF tracing tools. bpftrace is a higher-level tracing language (like DTrace for Linux):
# Count syscalls by process name (one-liner)
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Histogram of read() latencies
bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
@us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
Security: Falco and Tetragon
Falco (CNCF) uses eBPF to monitor system calls in real time and detect anomalous behavior (e.g., a container spawning an unexpected shell, sensitive file access). Tetragon (Cilium) provides kernel-level security observability and enforcement using LSM and kprobe BPF programs.
Profiling: Continuous profiling
Tools like Pyroscope and Parca use BPF_PROG_TYPE_PERF_EVENT programs to collect stack traces with minimal overhead — enabling always-on CPU profiling in production.
Summary
+-------+ compile +---------+ load +----------+
| C src | ------------> | bytecode | -------> | verifier |
+-------+ clang/LLVM +---------+ bpf() +----+-----+
|
pass | reject
v
+----------+ +-----+-----+
| native | <-- | JIT |
| code | +-----------+
+----+-----+
|
v
attach to hook
(kprobe, XDP, tracepoint, ...)
|
v
+----+-----+
| BPF maps | <--- user space reads/writes
+----------+
eBPF gives you a safe, fast, and flexible way to extend kernel behavior without modifying kernel source or loading risky modules. The verifier ensures safety, the JIT ensures performance, and maps provide the communication channel between kernel and user space. This combination has made eBPF one of the most important Linux innovations of the past decade.
References
- eBPF official documentation — comprehensive introduction and reference.
- BPF verifier source — the full verifier implementation.
- BPF helpers source — core helper functions.
- BPF hash map source — hash map implementation.
- x86 JIT compiler — x86-64 JIT backend.
- BPF instruction set — instruction definitions and program types.
- Cilium documentation — eBPF-based Kubernetes networking.
- bpftrace reference guide — high-level tracing language.
- BPF Performance Tools (Brendan Gregg) — the definitive book on BPF for observability.