System Design - How the Linux Kernel Network Stack Works

Open Table of contents

Context
Step 1: The NIC Receives a Frame
Step 2: Hardware Interrupt and NAPI
Step 3: Building an sk_buff
Step 4: GRO (Generic Receive Offload)
Step 5: The Network Layer (IP)
Step 6: The Transport Layer (TCP)
Step 7: Socket Receive Queue and Waking the Application
Putting It All Together: The Timeline
Key Tuning Parameters
Where Common Tools Hook In
References

Context

When your application calls recv() on a TCP socket, it gets back bytes. But those bytes traveled a long path inside the kernel before arriving. Understanding this path helps you reason about latency, packet drops, tuning parameters like ring buffer sizes, and why tools like tcpdump see packets at specific points.

This article traces the life of an incoming network packet from the moment it hits the physical NIC (Network Interface Card) to the moment your application reads it. We focus on the receive path (ingress) since that is the more complex direction.

  The Big Picture: Packet Receive Path

  +--------------------+
  |   Application      |   recv() / read() / epoll_wait()
  +--------+-----------+
           |  copy_to_user
  +--------v-----------+
  |   Socket Layer     |   per-socket receive queue (sk->sk_receive_queue)
  +--------+-----------+
           |
  +--------v-----------+
  |   Transport (TCP)  |   sequence reassembly, ACK generation
  +--------+-----------+
           |
  +--------v-----------+
  |   Network (IP)     |   routing, netfilter hooks (iptables)
  +--------+-----------+
           |
  +--------v-----------+
  |   NAPI / softirq   |   budget-based polling, GRO aggregation
  +--------+-----------+
           |
  +--------v-----------+
  |   Driver / Ring    |   DMA ring buffer, hardware interrupts
  +--------+-----------+
           |
  +--------v-----------+
  |   NIC Hardware     |   wire -> PHY -> MAC -> DMA to RAM
  +--------------------+

Step 1: The NIC Receives a Frame

Modern NICs are sophisticated devices. When an Ethernet frame arrives on the wire:

The PHY (physical layer chip) converts electrical/optical signals into digital bits.
The MAC (media access controller) verifies the frame check sequence (FCS/CRC) and strips preamble.
The NIC uses DMA (Direct Memory Access) to copy the frame into a pre-allocated region of host RAM called a ring buffer (or descriptor ring).

The ring buffer is a circular array of descriptors, each pointing to a pre-allocated memory buffer (typically 2KB or a page). The driver sets these up at initialization:

  Ring Buffer (simplified)

  head (NIC writes here)
    |
    v
  +------+------+------+------+------+------+------+------+
  | desc | desc | desc | desc | desc | desc | desc | desc |
  |  0   |  1   |  2   |  3   |  4   |  5   |  6   |  7   |
  +--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+
     |      |      |      |      |      |      |      |
     v      v      v      v      v      v      v      v
  [buf0] [buf1] [buf2] [buf3] [buf4] [buf5] [buf6] [buf7]
                   ^
                   |
               tail (driver refills here)

The NIC owns descriptors from head to tail (wrapping). After DMA-ing a frame into the buffer, the NIC updates the descriptor’s status bits and advances its internal head pointer. You can inspect ring buffer sizes with:

ethtool -g eth0
# Ring parameters for eth0:
# Pre-set maximums:
# RX:    4096
# Current hardware settings:
# RX:    256

If the ring fills up (driver cannot consume fast enough), the NIC drops frames silently — visible via ethtool -S eth0 | grep rx_missed or rx_no_buffer_count.

Step 2: Hardware Interrupt and NAPI

After placing a frame in the ring buffer, the NIC raises a hardware interrupt (IRQ). The kernel’s interrupt handler (registered by the driver) runs immediately:

// Simplified from drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
static irqreturn_t ixgbe_msix_clean_rings(int irq, void *data)
{
    struct ixgbe_q_vector *q_vector = data;

    // Disable further interrupts from this queue
    ixgbe_irq_disable_queues(q_vector->adapter, q_vector->ring_mask);

    // Schedule NAPI polling
    napi_schedule_irqoff(&q_vector->napi);

    return IRQ_HANDLED;
}

The interrupt handler does almost nothing — it just disables the NIC’s interrupt for that queue and schedules a NAPI poll. This is critical: if we processed every packet in interrupt context, a flood of packets would cause an “interrupt storm” (livelock), starving the rest of the system.

NAPI (New API) is the kernel’s solution. It switches between interrupt-driven mode (low load) and polling mode (high load):

  NAPI State Machine

        packet arrives
             |
             v
  +---------------------+     ring empty
  |  Interrupt fires    |<-----------------+
  |  (HW IRQ)           |                  |
  +----------+----------+                  |
             |                             |
             v disable IRQ                 |
  +---------------------+                  |
  |  napi_schedule()    |                  |
  |  (mark NAPI_SCHED)  |                  |
  +----------+----------+                  |
             |                             |
             v runs in softirq             |
  +---------------------+                  |
  |  napi_poll()        +------------------+
  |  process up to      |  done (< budget)
  |  'budget' packets   |  re-enable IRQ
  |  (default: 64)      |
  +----------+----------+
             |
             | still more packets (hit budget)
             v
  +---------------------+
  |  stay in poll mode  |
  |  (no IRQ needed)    |
  +---------------------+

The poll function runs in softirq context (specifically NET_RX_SOFTIRQ). It processes packets in a loop up to a budget (default 64 packets per poll cycle). The relevant kernel code is in net/core/dev.c:

static int napi_poll(struct napi_struct *n, struct list_head *repoll)
{
    int work, weight;

    weight = n->weight;  // typically 64
    work = n->poll(n, weight);  // driver's poll function

    if (work < weight) {
        // Done: processed fewer than budget -> re-enable IRQ
        napi_complete_done(n, work);
        return work;
    }
    // Hit budget: stay scheduled for another round
    return work;
}

Step 3: Building an sk_buff

Inside the driver’s poll function, each received frame is wrapped in the kernel’s core networking data structure: struct sk_buff (socket buffer). This structure is defined in include/linux/skbuff.h and has ~200 fields. The essential layout:

  sk_buff structure (simplified)

  struct sk_buff {
      // -- Linked-list pointers --
      struct sk_buff  *next, *prev;

      // -- Timing --
      ktime_t         tstamp;          // receive timestamp

      // -- Device --
      struct net_device *dev;          // which NIC

      // -- Protocol headers (pointers into data) --
      unsigned char   *head;           // start of allocated buffer
      unsigned char   *data;           // current data start
      unsigned char   *tail;           // current data end
      unsigned char   *end;            // end of allocated buffer

      // -- Layer pointers --
      __u16           transport_header; // offset to TCP/UDP
      __u16           network_header;   // offset to IP
      __u16           mac_header;       // offset to Ethernet

      // -- Length --
      unsigned int    len;             // total data length
      unsigned int    data_len;        // length in fragments

      // -- Protocol info --
      __be16          protocol;        // ETH_P_IP, ETH_P_IPV6, ...
      struct sock     *sk;             // owning socket (set later)
  };

The head/data/tail/end pointers define the buffer space:

  Buffer layout inside sk_buff

  head                  data            tail                end
   |                     |               |                   |
   v                     v               v                   v
  +------+--------+------+---------+-----+-------------------+
  | head | L2 hdr | L3   | L4 hdr  | pay |   tailroom        |
  | room | (ETH)  | (IP) | (TCP)   | load|                   |
  +------+--------+------+---------+-----+-------------------+

  <-------- headroom ----->
  <----- skb->len (logical data length) ----->

As the packet moves up through protocol layers, each layer calls skb_pull() to advance the data pointer past its own header, “consuming” that header from the perspective of the next layer.

Step 4: GRO (Generic Receive Offload)

Before passing packets up the stack, NAPI applies GRO — merging multiple small packets belonging to the same TCP flow into one large sk_buff. This dramatically reduces per-packet overhead for the upper layers:

  Without GRO:               With GRO:
  5 packets, 5 TCP trips     1 merged packet, 1 TCP trip

  [1500B] -+                 [7500B] --------> TCP layer
  [1500B]  |                   (5 segments merged)
  [1500B]  +-> 5x TCP
  [1500B]  |   processing
  [1500B] -+

GRO is the receive-side counterpart of TSO (TCP Segmentation Offload) on the transmit side. It groups packets by flow (same source/dest IP+port, same TCP connection) and merges payload. The combined sk_buff uses a frag_list or frags array to avoid copying data. Check GRO status:

ethtool -k eth0 | grep generic-receive-offload
# generic-receive-offload: on

Step 5: The Network Layer (IP)

After GRO, each sk_buff enters the IP layer via ip_rcv() in net/ipv4/ip_input.c. This function:

Validates the IP header (version, checksum, length sanity).
Passes through netfilter PREROUTING hooks — this is where iptables/nftables rules run (DNAT, connection tracking).
Makes a routing decision: is this packet for us (local delivery) or should it be forwarded?
For local delivery, calls ip_local_deliver().
Passes through netfilter INPUT hooks (firewall filtering).
Strips the IP header and hands to the transport layer.

  Netfilter hook points (IPv4)

              +----------+
  incoming -->| PREROUTE |--+
  packet      +----------+  |
                            v
                     +-----------+
                     |  Routing  |
                     |  Decision |
                     +-----+-----+
                           |
              +------------+------------+
              |                         |
              v (for us)                v (forward)
        +-----------+            +-----------+
        |   INPUT   |            |  FORWARD  |
        +-----------+            +-----------+
              |                         |
              v                         v
        local process              +-----------+
                                   | POSTROUTE |
                                   +-----------+
                                        |
                                        v
                                   outgoing

The routing lookup uses the FIB (Forwarding Information Base) — essentially the kernel’s routing table (ip route show). For locally-destined packets, the result says “deliver to local transport protocol handler.”

Step 6: The Transport Layer (TCP)

The IP layer calls into tcp_v4_rcv() in net/ipv4/tcp_ipv4.c. TCP processing is the most complex part of the receive path:

Socket lookup: Find the struct sock matching this 4-tuple (src_ip, src_port, dst_ip, dst_port). Uses a hash table for O(1) lookup.
State machine: Handle the TCP state (LISTEN, SYN_RECV, ESTABLISHED, etc.).
Sequence validation: Is the sequence number within the receive window?
Reassembly: Place out-of-order segments in the out-of-order queue (ofo_queue); deliver in-order data to the receive queue (sk->sk_receive_queue).
ACK generation: Schedule or immediately send ACKs. Delayed ACK coalesces acknowledgments (up to 40ms delay by default).
Window update: Advertise new receive window to the sender.
Congestion handling: Update congestion window if needed.

  TCP receive processing (simplified)

  tcp_v4_rcv(skb)
       |
       v
  tcp_v4_do_rcv(sk, skb)
       |
       +--> state == ESTABLISHED?
       |        |
       |        v
       |    tcp_rcv_established(sk, skb)   <-- fast path
       |        |
       |        +--> header prediction (fast path check)
       |        |        |
       |        |     +--v-----------+    +----------------+
       |        |     | In-order?    |--->| Add to         |
       |        |     | seq == rcv_nxt   | sk_receive_queue|
       |        |     +--------------+    +-------+--------+
       |        |                                 |
       |        |     +--------------+            v
       |        |     | Out of order |---> ofo_queue
       |        |     +--------------+     (rb-tree)
       |        |
       |        +--> wake up blocked reader (if any)
       |
       +--> other states: tcp_rcv_state_process()

The header prediction fast path in tcp_rcv_established() is a performance optimization. If the incoming segment is the next expected one (sequence == rcv_nxt), has no special flags, and the receive window is open, the kernel skips most of the complex state machine logic and directly queues the data — hitting this fast path is the common case for bulk data transfer.

Step 7: Socket Receive Queue and Waking the Application

Once TCP places data on sk->sk_receive_queue, it checks if a process is blocked waiting on this socket:

// Simplified from net/ipv4/tcp_input.c
static void tcp_data_ready(struct sock *sk)
{
    // Wake up any process blocked in recv() or epoll_wait()
    sk->sk_data_ready(sk);
}

If the application is blocked in recv(), the kernel wakes it via the socket’s wait queue. If it is using epoll, the socket is added to epoll’s ready list and epoll_wait() returns.

The actual data copy to user space happens in tcp_recvmsg():

// Simplified from net/ipv4/tcp.c
int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags)
{
    // Lock the socket
    lock_sock(sk);

    // Walk sk_receive_queue, copy data to user buffer
    skb_queue_walk(&sk->sk_receive_queue, skb) {
        // copy_to_user: kernel buffer -> user space buffer
        err = skb_copy_datagram_msg(skb, offset, msg, used);
        // free sk_buff after copying
    }

    // Update TCP receive window (now have more buffer space)
    tcp_cleanup_rbuf(sk, copied);

    release_sock(sk);
    return copied;
}

Putting It All Together: The Timeline

  Time -->

  NIC wire   NIC DMA     IRQ      softirq/NAPI    IP         TCP         App
    |          |          |            |            |           |           |
    v          v          v            v            v           v           v
  frame    ring buf    disable     poll driver   validate   seq check   recv()
  arrives  written     IRQ,        build skb,   route,     reassemble  returns
           (DMA)       schedule    GRO merge    netfilter  queue data  data
                       NAPI

  |<----- hardware ---->|<---------- kernel (softirq) -------->|<- syscall->|
        ~1-5 us              ~5-20 us (per packet)               ~1-3 us

Typical latencies on modern hardware (10 Gbps NIC, bare metal):

NIC DMA + interrupt: 1-5 microseconds
NAPI poll + GRO + IP + TCP: 5-20 microseconds per packet
System call overhead for recv(): 1-3 microseconds
Total NIC-to-application: ~10-30 microseconds for a single packet

Key Tuning Parameters

Parameter	What it controls	How to check/set
Ring buffer size	How many frames NIC can DMA before driver must poll	`ethtool -G eth0 rx 4096`
NAPI weight	Packets processed per poll cycle	`/sys/class/net/eth0/napi_defer_hard_irqs`
`net.core.netdev_budget`	Total packets processed per softirq round	`sysctl net.core.netdev_budget` (default 300)
`net.core.rmem_max`	Max socket receive buffer	`sysctl net.core.rmem_max`
`net.ipv4.tcp_rmem`	TCP auto-tuning min/default/max	`sysctl net.ipv4.tcp_rmem`
IRQ affinity	Which CPU handles which NIC queue	`/proc/irq/<N>/smp_affinity`
RPS/RFS	Software receive steering to spread load	`/sys/class/net/eth0/queues/rx-0/rps_cpus`

Where Common Tools Hook In

Understanding where observability tools tap into this pipeline helps you interpret their output:

  +----------+  +--------+  +-------+  +-------+  +--------+
  | tcpdump  |  | XDP/   |  | nf/   |  | TCP   |  | app    |
  | libpcap  |  | eBPF   |  | ipt   |  | probe |  | trace  |
  +----+-----+  +---+----+  +---+---+  +---+---+  +---+----+
       |             |           |          |          |
  =====v=============v===========v==========v==========v======
  NIC -> ring -> NAPI/skb -> IP layer -> TCP layer -> socket
  ================================================================

  tcpdump:   sees raw frames right after NIC (AF_PACKET socket)
  XDP/eBPF:  runs BEFORE sk_buff allocation (fastest hook)
  iptables:  netfilter hooks in IP layer (PREROUTING, INPUT, etc.)
  TCP tracepoints: ftrace/bpftrace probes in TCP functions

Note: tcpdump captures happen before iptables INPUT rules, which is why you can see packets in tcpdump that your firewall drops.

References

Linux kernel source: net/core/dev.c — core network device handling and NAPI
Linux kernel source: net/ipv4/tcp_input.c — TCP receive processing
Linux kernel source: include/linux/skbuff.h — sk_buff definition
“Understanding Linux Network Internals” by Christian Benvenuti (O’Reilly, 2006)
Linux NAPI documentation: Documentation/networking/napi.rst
Memory and Networking: Linux Foundation wiki on sk_buff
Blog: “Illustrated Guide to Monitoring and Tuning the Linux Networking Stack” by packagecloud.io