Table of contents
Open Table of contents
- Context
- The Three System Calls
- Internal Architecture
- How Events Flow: The Callback Mechanism
- epoll_wait: Harvesting Events
- Level-Triggered vs. Edge-Triggered
- EPOLLONESHOT: Single-Fire Mode
- Real-World Usage: How nginx Uses epoll
- Comparison: select vs. poll vs. epoll
- Cross-Platform Alternatives
- References
Context
Imagine you are building a chat server. Thousands of users are connected simultaneously, but at any given moment only a handful are actually sending messages. Your server needs to figure out which connections have data ready to read — without wasting CPU cycles checking every single one.
This is the I/O multiplexing problem: how does a single thread efficiently monitor many file descriptors (sockets, pipes, files) and react only when something happens?
Unix has offered solutions to this problem for decades: select (1983) and poll (1997). But both have a fundamental scalability issue — they require the kernel to scan every file descriptor on each call, even if only one is ready. With 10,000 connections, that’s 10,000 checks per call, most returning “nothing happened.”
The Scalability Problem with select/poll
Call: poll(fds, 10000, timeout)
Kernel must check:
fd[0] -> not ready
fd[1] -> not ready
fd[2] -> READY! <-- 1 out of 10,000
fd[3] -> not ready
...
fd[9999] -> not ready
Result: O(n) scan for O(1) events
Next call: scan all 10,000 again
In 2002, Davide Libenzi introduced epoll into the Linux kernel (version 2.5.44). The key insight: instead of scanning all file descriptors on every call, let the kernel track which descriptors are ready and hand you only those. This turns the cost from O(n) per call to O(ready), where “ready” is typically a tiny fraction of the total.
Today, epoll is the backbone of virtually every high-performance Linux network server: nginx, Redis, Node.js (via libuv), Go’s runtime netpoller, and Java’s NIO.
The Three System Calls
epoll’s API is remarkably simple — just three system calls defined in fs/eventpoll.c:
#include <sys/epoll.h>
// 1. Create an epoll instance
int epoll_create1(int flags);
// 2. Add, modify, or remove file descriptors
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
// 3. Wait for events
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
Here is how they work together in a typical server:
// Create the epoll instance
int epfd = epoll_create1(0);
// Tell epoll to watch the listening socket for incoming connections
struct epoll_event ev;
ev.events = EPOLLIN; // interested in "readable" events
ev.data.fd = listen_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
// Event loop
struct epoll_event ready[64];
while (1) {
int n = epoll_wait(epfd, ready, 64, -1); // block until something happens
for (int i = 0; i < n; i++) {
if (ready[i].data.fd == listen_fd) {
// New connection — accept it and add to epoll
int client = accept(listen_fd, ...);
ev.events = EPOLLIN;
ev.data.fd = client;
epoll_ctl(epfd, EPOLL_CTL_ADD, client, &ev);
} else {
// Existing connection has data — read it
handle_client(ready[i].data.fd);
}
}
}
The critical difference from select/poll: you register interest once with epoll_ctl, and then each epoll_wait call returns only the ready descriptors. No repeated scanning.
Internal Architecture
Inside the kernel, an epoll instance (struct eventpoll) maintains two core data structures:
+-----------------------------------------------+
| struct eventpoll |
| |
| +------------------+ +------------------+ |
| | Red-Black Tree | | Ready List | |
| | (rbr) | | (rdllist) | |
| | | | | |
| | All monitored | | FDs that have | |
| | file descriptors| | pending events | |
| | | | | |
| | Keyed by (fd, | | Doubly-linked | |
| | file pointer) | | list for O(1) | |
| | | | insertion | |
| | O(log n) lookup | | | |
| +------------------+ +------------------+ |
| |
| lock: spinlock + mutex |
| wq: wait queue (sleeping epoll_wait callers) |
+-----------------------------------------------+
Red-black tree (rbr): Stores every file descriptor registered via epoll_ctl. Keyed by the combination of file descriptor number and file pointer, so lookups, insertions, and deletions are O(log n). When you call epoll_ctl(EPOLL_CTL_ADD), the kernel creates an epitem node and inserts it here.
Ready list (rdllist): A doubly-linked list of epitem nodes that currently have events. When epoll_wait is called, the kernel simply drains this list and copies events to userspace — no scanning required.
The relevant kernel structures, simplified:
struct eventpoll {
spinlock_t lock;
struct mutex mtx;
wait_queue_head_t wq; // processes blocked in epoll_wait
struct rb_root_cached rbr; // red-black tree of all monitored fds
struct list_head rdllist; // ready file descriptors
};
struct epitem {
struct rb_node rbn; // node in the red-black tree
struct list_head rdllink; // link in the ready list
struct eventpoll *ep; // back pointer to the epoll instance
struct epoll_filefd ffd; // the file descriptor being monitored
struct epoll_event event; // events the user is interested in
};
How Events Flow: The Callback Mechanism
The magic of epoll’s efficiency lies in a callback-driven design. Instead of the kernel scanning file descriptors, each file descriptor notifies the epoll instance when its state changes.
How an event reaches epoll_wait
1. Network card receives packet
|
v
2. Kernel network stack processes packet,
places data in socket receive buffer
|
v
3. Socket calls its wait queue callback
--> ep_poll_callback()
|
v
4. ep_poll_callback:
- Adds the epitem to rdllist (ready list)
- Wakes up any process sleeping in epoll_wait
|
v
5. epoll_wait wakes up, finds items on rdllist,
copies events to userspace buffer
|
v
6. Returns to application with ready fds
When epoll_ctl(EPOLL_CTL_ADD) is called, the kernel does two things:
- Inserts an
epiteminto the red-black tree. - Registers a callback function (
ep_poll_callback) on the file descriptor’s wait queue.
From fs/eventpoll.c, the callback:
static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode,
int sync, void *key)
{
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
__poll_t pollflags = key_to_poll(key);
spin_lock_irqsave(&ep->lock, flags);
// Add this item to the ready list if not already there
if (!ep_is_linked(epi))
list_add_tail(&epi->rdllink, &ep->rdllist);
// Wake up anyone blocked in epoll_wait
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
spin_unlock_irqrestore(&ep->lock, flags);
return 1;
}
This is what makes epoll O(ready) instead of O(total): the kernel never scans. It just checks the ready list.
epoll_wait: Harvesting Events
When your application calls epoll_wait, the kernel function ep_poll runs:
ep_poll() flow
+---------------------------+
| Is rdllist empty? |
+------------+--------------+
|
+-------+--------+
| |
v v
YES NO
| |
v v
+----------+ +--------------+
| Sleep on | | Proceed to |
| wq until | | ep_send_ |
| timeout | | events() |
| or wake | +--------------+
+----------+ |
| v
v +------------------+
(woken by | Walk rdllist, |
callback) | copy events to |
| | userspace buffer |
+--------->| up to maxevents |
+------------------+
|
v
Return count of
ready events
The core of event delivery is ep_send_events_proc, which iterates the ready list:
// Simplified from ep_send_events_proc
list_for_each_entry_safe(epi, tmp, &ep->rdllist, rdllink) {
// Remove from ready list
list_del_init(&epi->rdllink);
// Poll the fd to get current events
revents = ep_item_poll(epi, &pt, 1);
if (!revents)
continue;
// Copy to userspace
if (__put_user(revents, &uevent->events) ||
__put_user(epi->event.data, &uevent->data))
return error;
eventcnt++;
// Level-triggered: re-add to ready list if events persist
if (!(epi->event.events & EPOLLET))
list_add_tail(&epi->rdllink, &ep->rdllist);
}
Notice the last if statement — this is where level-triggered and edge-triggered behavior diverge.
Level-Triggered vs. Edge-Triggered
epoll supports two notification modes, and understanding the difference is critical for building correct servers.
Level-Triggered (LT) Edge-Triggered (ET)
Default mode Set with EPOLLET flag
"Tell me whenever data "Tell me once when data
is available" becomes available"
+--------+--------+--------+ +--------+--------+--------+
| wait | wait | wait | | wait | wait | wait |
+---+----+---+----+---+----+ +---+----+---+----+---+----+
| | | | | |
v v v v v v
READY READY READY READY (silent) (silent)
| | | |
v v v v
read read read read
100 B 100 B 100 B ALL data
(partial)(partial)(done) (must drain)
Socket has 300 bytes. Socket has 300 bytes.
LT fires on every ET fires once. If you
epoll_wait as long as only read 100 bytes,
buffer is non-empty. epoll_wait won't fire
again until NEW data
arrives. The remaining
200 bytes sit unread.
Level-triggered (default): After epoll_wait returns a ready fd, if you don’t read all the data, the fd stays on the ready list. The next epoll_wait will report it again. This is forgiving — partial reads don’t lose data.
Edge-triggered (EPOLLET): The callback fires only on state transitions (e.g., buffer goes from empty to non-empty). Once notified, you must read until EAGAIN to drain the buffer completely. If you stop early, no further notification comes until the next new data arrives.
Why use edge-triggered? It reduces the number of epoll_wait wakeups under high throughput. A busy socket that constantly has data would wake a level-triggered loop on every call, but an edge-triggered loop wakes only on transitions. The trade-off: your application code must be more careful.
// Edge-triggered read loop: must drain completely
ev.events = EPOLLIN | EPOLLET;
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
// When epoll_wait reports this fd:
while (1) {
ssize_t n = read(fd, buf, sizeof(buf));
if (n == -1) {
if (errno == EAGAIN)
break; // buffer fully drained, safe to return to epoll_wait
handle_error();
}
if (n == 0) {
close(fd); // peer closed connection
break;
}
process(buf, n);
}
EPOLLONESHOT: Single-Fire Mode
There is a third mode worth knowing: EPOLLONESHOT. After an event fires, the fd is automatically disabled — no more events until you explicitly re-arm it with epoll_ctl(EPOLL_CTL_MOD).
This is useful in multi-threaded servers. Without it, two threads could both be woken by epoll_wait for the same fd, leading to race conditions. With EPOLLONESHOT, only one thread handles the fd at a time:
Thread pool with EPOLLONESHOT
Thread 1 Thread 2
| |
v v
epoll_wait() epoll_wait()
| |
v |
fd 5 ready (blocked, fd 5 is disabled)
| |
handle fd 5 |
| |
re-arm fd 5 |
(EPOLL_CTL_MOD) |
| v
| fd 5 ready (new event)
| |
Real-World Usage: How nginx Uses epoll
nginx is perhaps the most well-known epoll user. Its event loop in src/event/modules/ngx_epoll_module.c follows this pattern:
nginx worker process (single-threaded)
+---> epoll_wait(epfd, events, max, timer)
| |
| v
| For each ready event:
| |
| +---+---+---+---+
| | | |
| v v v
| accept read write
| new from to
| conn client client
| | | |
| +---+---+---+---+
| |
| Process timer events
| |
+---------+
nginx uses edge-triggered mode with non-blocking sockets. Each worker process runs a single-threaded event loop, handling thousands of connections concurrently without threads. The worker:
- Calls
epoll_waitto collect ready events. - Processes each event (accept new connections, read requests, write responses).
- Handles timer-based events (keepalive timeouts, upstream timeouts).
- Loops back to
epoll_wait.
This is why a single nginx worker can handle tens of thousands of concurrent connections with minimal memory and CPU overhead.
Comparison: select vs. poll vs. epoll
Feature select poll epoll
------- ------ ---- -----
Year 1983 1997 2002
Max FDs FD_SETSIZE unlimited unlimited
(typically 1024)
Per-call cost O(n) O(n) O(ready)
Registration re-register re-register register once
every call every call (persistent)
Kernel impl bitmap scan array scan callback + ready list
Trigger mode level only level only level or edge
Memory copy fd_set copied pollfd array events copied only
both ways copied both ways kernel -> user
Portability POSIX POSIX Linux only
The key differentiators are the persistent registration (no re-copying the entire fd set each call) and the callback-driven ready list (no kernel-side scan). Together, these give epoll O(1) per ready event regardless of total monitored count.
Cross-Platform Alternatives
epoll is Linux-specific. Other operating systems have their own high-performance I/O multiplexers:
- kqueue (FreeBSD, macOS): Similar callback-driven design, supports file system events natively.
- IOCP (Windows): Completion-based model — you get notified after I/O completes, not when it’s ready to start.
- io_uring (Linux 5.1+): The successor for high-throughput I/O. Uses shared ring buffers between kernel and userspace, reducing system call overhead.
Libraries like libuv (Node.js), libevent, and libev abstract over these differences, giving you a portable event loop that uses epoll on Linux, kqueue on macOS, and IOCP on Windows.
References
- Linux kernel epoll implementation
fs/eventpoll.c - epoll(7) man page man7.org
- epoll_create(2) man page man7.org
- epoll_ctl(2) man page man7.org
- epoll_wait(2) man page man7.org
- nginx epoll module
src/event/modules/ngx_epoll_module.c - The C10K problem kegel.com
- Scalable Event Multiplexing: epoll vs kqueue kernel.org LWN
- io_uring and the future of async I/O kernel.dk