Skip to content
JZLeetCode
Go back

System Design - How Go's Goroutine Scheduler Works

Table of contents

Open Table of contents

Context: Why User-Space Scheduling Matters

Every program needs a way to do multiple things at once. Operating systems provide threads, but OS threads are expensive:

If your web server needs to handle 100,000 concurrent connections, you cannot afford 100,000 OS threads. That is 100 GB of stack space alone.

Go’s answer: goroutines. A goroutine starts with only 2 KB of stack (which grows dynamically) and is scheduled entirely in user space. The Go runtime multiplexes potentially millions of goroutines onto a small pool of OS threads. This is called M:N scheduling — M goroutines on N OS threads.

    Traditional 1:1 Threading              Go's M:N Scheduling

  +-------+  +-------+  +-------+      +--+ +--+ +--+ +--+ +--+ +--+
  |Thread1|  |Thread2|  |Thread3|      | G| | G| | G| | G| | G| | G|  goroutines
  +---+---+  +---+---+  +---+---+      +-++ +-++ +-++ +--+ +--+ +-++
      |          |          |             |    |    |              |
      v          v          v             +----+----+    +---------+
  +---+---+  +---+---+  +---+---+             |         |
  | OS Th |  | OS Th |  | OS Th |          +--+--+   +--+--+
  +---+---+  +---+---+  +---+---+          | M(1)|   | M(2)|   OS threads
      |          |          |               +--+--+   +--+--+
  ----+----------+----------+----              |         |
            Kernel                    ---------+---------+----------
                                                  Kernel

The magic happens inside the Go runtime scheduler, a piece of code compiled into every Go binary. Let’s take it apart.

The GMP Model

Go’s scheduler is built on three core abstractions, known as GMP:

LetterNameWhat it is
GGoroutineA unit of work — your go func(). Has its own stack, instruction pointer, and state.
MMachineAn OS thread. The thing that actually executes instructions on a CPU core.
PProcessorA logical CPU / scheduling context. Holds a local run queue of goroutines. The number of Ps equals GOMAXPROCS.

The key insight: a goroutine (G) can only run when an M acquires a P. The P is the “ticket to schedule.” This three-way binding decouples the number of goroutines from the number of OS threads and from the number of CPU cores.

                         GMP Relationship

          +------+    +------+    +------+    +------+
          |  G   |    |  G   |    |  G   |    |  G   |   (millions possible)
          +--+---+    +--+---+    +--+---+    +--+---+
             |           |           |           |
             v           v           v           v
          +-----------------------------------+-------+
          |        Local Run Queue (P0)       | LRQ P1|
          +-----------------+-----------------+---+---+
                            |                     |
                     +------+------+       +------+------+
                     |   P0        |       |   P1        |   (GOMAXPROCS = 2)
                     +------+------+       +------+------+
                            |                     |
                     +------+------+       +------+------+
                     |   M0        |       |   M1        |   OS threads
                     +------+------+       +------+------+
                            |                     |
                  ----------+---------------------+---------
                                   Kernel

Data Structures (Source Code)

The core structs live in src/runtime/runtime2.go:

The g struct (goroutine)

type g struct {
    stack       stack   // lo and hi bounds of the goroutine stack
    stackguard0 uintptr // for stack growth check in function prologues
    m           *m      // current M executing this G (nil if not running)
    sched       gobuf   // saved registers (SP, PC, etc.) for context switch
    atomicstatus atomic.Uint32 // goroutine state
    goid         uint64 // unique goroutine ID
    preempt      bool   // preemption signal
    // ... many more fields
}

The m struct (OS thread)

type m struct {
    g0      *g     // special goroutine for scheduling stack
    curg    *g     // current user goroutine running on this M
    p       puintptr // attached P (nil if not executing user code)
    nextp   puintptr // P to attach on next schedule
    spinning bool   // is this M looking for work?
    // ...
}

The p struct (processor)

type p struct {
    id          int32
    status      uint32 // idle, running, syscall, etc.
    m           muintptr // back-link to the M currently using this P
    runqhead    uint32
    runqtail    uint32
    runq        [256]guintptr // local run queue (fixed-size ring buffer!)
    runnext     guintptr      // next G to run (hot cache, size 1)
    // ...
}

Notice that each P has a local run queue that holds up to 256 goroutines in a lock-free ring buffer. There is also a global run queue protected by a mutex, used as overflow.

                     Scheduler Queues

   +------------------------------------------------------------------+
   |                    Global Run Queue (mutex)                       |
   |   [ G ] [ G ] [ G ] [ G ] ...                                    |
   +------------------------------------------------------------------+
          |                                          |
          | (steal/put when local is full/empty)     |
          v                                          v
   +--------------------+                    +--------------------+
   |   P0 Local Queue   |                    |   P1 Local Queue   |
   |  [G][G][G][G]      |                    |  [G][G]            |
   |  runnext: G        |                    |  runnext: G        |
   +--------+-----------+                    +--------+-----------+
            |                                         |
            v                                         v
        +-------+                                 +-------+
        |  M0   | --running--> [current G]        |  M1   | --running--> [current G]
        +-------+                                 +-------+

The Scheduling Loop

Every M runs an infinite scheduling loop. The entry point is schedule() in src/runtime/proc.go:

func schedule() {
    mp := getg().m

    // Every 61st schedule tick, check the global queue to prevent starvation
    if mp.schedtick%61 == 0 {
        if gp := globrunqget(pp, 1); gp != nil {
            // found one from global queue
        }
    }

    // Try local queue first
    if gp, inheritTime := runqget(pp); gp != nil {
        // found one from local queue
    }

    // Nothing local, go looking (blocking)
    gp = findRunnable()

    execute(gp)
}

The simplified flow:

                  schedule()
                      |
                      v
        +----------------------------+
        | Every 61st tick: check     |   (prevents global queue starvation)
        | global run queue           |
        +-------------+--------------+
                      |
                      v (nothing found)
        +----------------------------+
        | runqget(local P queue)     |   (check runnext, then ring buffer)
        +-------------+--------------+
                      |
                      v (nothing found)
        +----------------------------+
        | findRunnable()             |   (work stealing, global queue,
        |   - try global queue       |    netpoll, other Ps)
        |   - try netpoll           |
        |   - try stealing from     |
        |     other P's local queue |
        +-------------+--------------+
                      |
                      v
        +----------------------------+
        | execute(gp)               |   (set gp.status = running,
        |   gogo(&gp.sched)        |    restore registers, jump to PC)
        +----------------------------+

When a goroutine finishes (goexit), blocks on a channel, or hits a preemption point, control returns to schedule() and the loop repeats.

Work Stealing

When a P’s local run queue is empty, the runtime does not just sit idle. It steals work from other Ps. This is implemented in stealWork():

func stealWork(now int64) (gp *g, inheritTime bool, tnow, newStatus int64) {
    pp := getg().m.p.ptr()
    ranTimer := false

    const stealTries = 4
    for i := 0; i < stealTries; i++ {
        stealTimersOrRunnable := i == stealTries-1

        // Iterate over Ps in random order
        for enum := stealOrder.start(cheaprand()); !enum.done(); enum.next() {
            p2 := allp[enum.position()]
            if p2 == pp {
                continue
            }
            // Try to steal half of p2's run queue
            if gp := runqsteal(pp, p2, stealTimersOrRunnable); gp != nil {
                return gp, false, now, 0
            }
        }
    }
    return nil, false, now, 0
}

Key details:

     Work Stealing Example

     P0 (empty)                P1 (has 6 Gs)
     +---------+               +---------+
     |  [ ]    |  <---steal--- | [G][G][G][G][G][G] |
     +---------+               +---------+
                                     |
                                     v (after steal)
     P0 (got 3)                P1 (kept 3)
     +---------+               +---------+
     | [G][G][G]|              | [G][G][G] |
     +---------+               +---------+

Preemption

A scheduler is useless if a goroutine can run forever without yielding. Go uses two preemption mechanisms:

Cooperative Preemption (since Go 1.2)

The Go compiler inserts a stack growth check at the beginning of every function:

// Pseudo-assembly for a function prologue
MOV  SP, R1
CMP  R1, g.stackguard0   // compare SP against the stack guard
BLS  morestack           // if SP <= guard, call morestack

When the runtime wants to preempt a goroutine, it sets g.stackguard0 = stackPreempt (a special sentinel value). The next function call in that goroutine triggers the check, sees the sentinel, and yields control back to the scheduler.

This is called “cooperative” because the goroutine must make a function call for preemption to happen. A tight loop like for {} with no function calls cannot be preempted this way.

Asynchronous Preemption (since Go 1.14)

Go 1.14 added signal-based preemption to handle tight loops. The sysmon goroutine (a background monitor thread) detects goroutines running for more than 10ms and sends a SIGURG signal to the OS thread:

// In sysmon (simplified)
func retake(now int64) uint32 {
    for i := 0; i < len(allp); i++ {
        pp := allp[i]
        pd := &pp.sysmontick
        if pp.status == _Prunning {
            t := int64(pp.schedtick)
            if pd.schedtick != t {
                pd.schedtick = t
                pd.schedwhen = now
            } else if pd.schedwhen+forcePreemptNS <= now { // 10ms
                preemptone(pp)
            }
        }
    }
}

func preemptone(pp *p) bool {
    mp := pp.m.ptr()
    gp := mp.curg
    gp.preempt = true
    gp.stackguard0 = stackPreempt
    // Send async preemption signal
    if preemptMSupported {
        signalM(mp, sigPreempt) // SIGURG
    }
    return true
}

The signal handler (asyncPreempt) saves the current state and switches back to the scheduler. This ensures even infinite loops like for { x++ } get preempted.

Goroutine States

A goroutine transitions through these states (defined in src/runtime/runtime2.go):

                     Goroutine State Machine

                          go func()
                              |
                              v
                       +-----------+
                       | _Grunnable|   (in a run queue, waiting to be picked up)
                       +-----+-----+
                             |
                   schedule()|
                             v
                       +-----------+
              +------->| _Grunning |   (executing on an M/P)
              |        +-----+-----+
              |              |
              |   +----------+----------+
              |   |          |          |
              |   v          v          v
              | block     preempt    finished
              |   |          |          |
              |   v          v          v
              | +---------+ +--------+ +-------+
              | |_Gwaiting| |_Grunnable| |_Gdead |
              | +---------+ +--------+ +-------+
              |       |
              |       | (woken up: channel recv, timer, I/O ready)
              |       v
              |  +-----------+
              +--| _Grunnable|
                 +-----------+
StateValueMeaning
_Gidle0Just allocated, not yet initialized
_Grunnable1On a run queue, ready to run
_Grunning2Currently executing on M/P
_Gsyscall3In a system call, M may lose its P
_Gwaiting4Blocked (channel, mutex, I/O, sleep, etc.)
_Gdead6Finished execution or just created
_Gpreempted9Stopped at a safe point by async preemption

Netpoller Integration

One of Go’s biggest strengths is that goroutines doing I/O do not block OS threads. The secret is the netpoller — a Go-runtime wrapper around OS multiplexing (epoll on Linux, kqueue on macOS).

When a goroutine does a network read and the socket is not ready:

  Goroutine calls net.Read()
         |
         v
  runtime sets socket non-blocking
  calls epoll_ctl(ADD, fd, ...)
         |
         v
  goroutine is parked (_Gwaiting)
  M is free to run other goroutines!
         |
         v
  ... later, data arrives ...
         |
         v
  netpoll() returns list of ready fds
  (called from findRunnable or sysmon)
         |
         v
  goroutines are marked _Grunnable
  and put back on run queues

The implementation lives in src/runtime/netpoll_epoll.go (Linux) and src/runtime/netpoll_kqueue.go (macOS):

// netpoll checks for ready network connections.
// Returns a list of goroutines that become runnable.
func netpoll(delay int64) gList {
    // calls epoll_wait (Linux) or kevent (macOS)
    var events [128]epollevent
    n := epollwait(epfd, &events[0], int32(len(events)), waitms)
    
    var toSchedule gList
    for i := int32(0); i < n; i++ {
        pd := *(**pollDesc)(unsafe.Pointer(&events[i].data))
        // wake up goroutines waiting on this fd
        netpollready(&toSchedule, pd, mode)
    }
    return toSchedule
}

This design means a Go program handling 100,000 concurrent TCP connections uses only GOMAXPROCS OS threads (typically equal to the number of CPU cores), not 100,000 threads.

How go func() Creates a Goroutine

When you write go myFunction(), the compiler transforms it into a call to newproc:

func newproc(fn *funcval) {
    gp := getg()         // get current goroutine
    pc := getcallerpc()  // for stack trace
    
    // Switch to g0 stack to create the new goroutine
    systemstack(func() {
        newg := newproc1(fn, gp, pc, false, waitReasonZero)

        pp := getg().m.p.ptr()
        runqput(pp, newg, true) // put on local P's queue

        if mainStarted {
            wakep() // wake an idle P if available
        }
    })
}

The newproc1 function allocates a g struct (or reuses one from the free list), sets up the initial 2 KB stack, and configures the saved registers so that when the goroutine is first scheduled, it starts executing fn.

  go myFunc(arg1, arg2)
         |
         | compiler rewrites to:
         v
  newproc(&funcval{fn: myFunc})
         |
         v
  newproc1: allocate/reuse g struct
         |  set up 2KB stack
         |  save entry point in g.sched.pc
         v
  runqput(currentP, newG, true)
         |
         |  runnext slot empty? put there (hot path)
         |  else: push to tail of ring buffer
         |  ring buffer full? put half into global queue
         v
  wakep(): wake idle M/P pair if available

System Calls and Thread Handoff

When a goroutine makes a blocking system call (e.g., file I/O, CGO), the M enters the kernel and cannot run other goroutines. The scheduler handles this by detaching the P:

  Before syscall:            During syscall:            After syscall:

  +----+    +----+           +----+      +----+         +----+    +----+
  | P0 |--->| M0 |           | P0 |--->  | M2 |         | P0 |--->| M0 |
  +----+    +----+           +----+      +----+         +----+    +----+
              |                            |                        |
              v              | M0 |        v                        v
           [G(user)]         | in  |    [other G]               [G(user)]
                             |syscall|
                             +----+
                               |
                               v
                            [G blocked in kernel]

The mechanism in code (src/runtime/proc.go):

  1. entersyscall() — called before the syscall. Saves state, marks P as _Psyscall.
  2. sysmon detects stalled P — if the M has been in a syscall for >20us (or one sysmon tick), sysmon calls handoffp() to give the P to another (or new) M.
  3. exitsyscall() — when the syscall returns, M tries to re-acquire its old P. If that P was handed off, M tries to get any idle P. If no P is available, M parks itself and the goroutine goes to the global queue.
func handoffp(pp *p) {
    // If local run queue is not empty, start an M to run it
    if !runqempty(pp) || sched.runqsize != 0 ||
        sched.nmspinning.Load() == 0 && sched.npidle.Load() > 0 {
        startm(pp, false, false)
        return
    }
    // ... else put P on idle list
    pidleput(pp, 0)
}

This is why you sometimes see more M (OS threads) than P (GOMAXPROCS): extra Ms are parked or blocked in syscalls while Ps continue scheduling other goroutines.

Performance Considerations

GOMAXPROCS

GOMAXPROCS controls the number of Ps — the maximum number of goroutines executing simultaneously on CPUs. Since Go 1.5, it defaults to the number of available CPU cores. Setting it too low underutilizes CPUs. Setting it higher than available cores adds scheduling overhead without benefit for CPU-bound work.

import "runtime"

func init() {
    runtime.GOMAXPROCS(4) // use 4 Ps
}

Goroutine Stack Growth

Goroutines start with a 2 KB stack (configurable but rarely changed). When a function’s prologue detects the stack is too small, the runtime allocates a new, larger stack (typically 2x the old size) and copies the old stack contents to the new one. This is called a copyable stack or segmented stack replacement (Go switched from segmented to copying stacks in Go 1.4).

  Initial:     After growth:

  +------+     +-------------+
  | 2 KB |     |    4 KB     |
  | stack|     |   stack     |
  |      |     |  (copied    |
  +------+     |   content)  |
               +-------------+

This means goroutine creation is cheap (O(1)O(1), just allocating a small stack), and stacks only grow when needed. In practice, most goroutines never grow beyond their initial 2-8 KB.

The sysmon Thread

The runtime spawns a special M that runs without a P — the system monitor (sysmon). It runs in a loop and:

Putting It All Together

Here is the full lifecycle of a goroutine from birth to death:

  1. go myFunc()
         |
  2. newproc -> allocate G, put on P's local queue
         |
  3. schedule() on some M picks up G
         |
  4. execute(G) -> gogo(&G.sched) -> jump to myFunc
         |
  5. G runs on CPU...
         |
         +--- hits function call -> stack check -> preemption?
         |         |
         |         +-- yes: save state, back to schedule()
         |         +-- no:  continue running
         |
         +--- blocks on channel/mutex/IO
         |         |
         |         +-- gopark(): G -> _Gwaiting, back to schedule()
         |         +-- ... later: goready(): G -> _Grunnable, onto a queue
         |
         +--- makes syscall
         |         |
         |         +-- entersyscall(): P may be handed off
         |         +-- exitsyscall(): try to reclaim P
         |
  6. myFunc returns -> goexit() -> G -> _Gdead -> put on free list
         |
  7. schedule() picks next G (loop back to step 3)

Summary

ConceptPurpose
G (goroutine)Lightweight unit of execution with its own stack
M (machine/thread)OS thread that executes code
P (processor)Scheduling context with a local run queue; limits parallelism to GOMAXPROCS
Local run queuePer-P, lock-free, 256-slot ring buffer
Global run queueOverflow queue, protected by mutex
Work stealingIdle P steals half of another P’s queue
Cooperative preemptionStack-check in function prologues
Async preemptionSIGURG signal for tight loops (Go 1.14+)
Netpollerepoll/kqueue integration so I/O doesn’t block threads
sysmonBackground thread for preemption, GC triggers, netpoll, syscall retake
HandoffP detaches from M blocked in syscall

Go’s scheduler achieves excellent throughput for concurrent workloads by keeping OS threads busy (work stealing), avoiding blocking them (netpoller, handoff), and doing all context switches in user space (~tens of nanoseconds vs. microseconds for kernel switches).

References

  1. Go runtime source — proc.go: github.com/golang/go/blob/master/src/runtime/proc.go
  2. Go runtime source — runtime2.go (structs): github.com/golang/go/blob/master/src/runtime/runtime2.go
  3. Go runtime source — netpoll_epoll.go: github.com/golang/go/blob/master/src/runtime/netpoll_epoll.go
  4. Original GMP design doc by Dmitry Vyukov (2012): docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw
  5. Go 1.14 Release Notes (async preemption): go.dev/doc/go1.14
  6. “The Go scheduler” by Daniel Morsing: morsmachine.dk/go-scheduler
  7. “Go’s work-stealing scheduler” by Jaana Dogan: rakyll.org/scheduler
  8. Go runtime source — proc.go sysmon: search for func sysmon() in proc.go
Share this post on:

Previous Post
LeetCode 236 Lowest Common Ancestor of a Binary Tree
Next Post
LeetCode 1143 Longest Common Subsequence