System Design - Kubernetes Pod Scheduling, Custom Schedulers, and Karpenter

Open Table of contents

Context
The Default Scheduler: kube-scheduler
- A Concrete Example
The Scheduling Framework
Running Multiple Schedulers
Case Study: TiDB’s Custom Scheduler
Where Karpenter Fits In
- Karpenter vs. Cluster Autoscaler
Putting It All Together
Summary
References

Context

When you run kubectl apply -f my-pod.yaml, Kubernetes needs to decide which node in your cluster should run that pod. This decision is called scheduling. It sounds simple, but the scheduler must consider CPU and memory requests, affinity rules, topology constraints, storage locality, and more — all in milliseconds.

A common misconception is that tools like Karpenter replace the Kubernetes scheduler. They don’t. Karpenter provisions nodes (creates or removes VMs), while the scheduler assigns pods to nodes. They work in sequence: when the scheduler cannot place a pod because no node has enough capacity, Karpenter notices and spins up a new node. Once the node is ready, the scheduler assigns the pod to it.

Some workloads need scheduling logic that the default scheduler doesn’t provide. For example, TiDB (a distributed database) needs to spread its storage pods across failure domains so that losing one rack doesn’t lose a quorum. TiDB solves this with a custom scheduler called tidb-scheduler. Let’s start from the default scheduler and build up to understand how all these pieces fit together.

The Default Scheduler: kube-scheduler

The default Kubernetes scheduler is called kube-scheduler. It runs as a control-plane component and watches the API server for newly created pods that don’t yet have a spec.nodeName (i.e., unscheduled pods).

For each unscheduled pod, the scheduler runs a two-step algorithm:

                  kube-scheduler: Pod Placement Pipeline

  Unscheduled Pod
        |
        v
  +---------------------+
  |   1. FILTERING       |   "Which nodes CAN run this pod?"
  |                     |
  |   Check each node:  |
  |   - Enough CPU?     |
  |   - Enough memory?  |
  |   - Matching taints |
  |     and tolerations?|
  |   - Affinity rules? |
  |   - Port available? |
  +----------+----------+
             |
             v
       Feasible Nodes
      (nodes that pass)
             |
             v
  +----------+----------+
  |   2. SCORING         |   "Which node is BEST?"
  |                     |
  |   Score each node:  |
  |   - Spread evenly?  |
  |   - Least requested |
  |     resources?      |
  |   - Data locality?  |
  |   - Image already   |
  |     pulled?         |
  +----------+----------+
             |
             v
       Highest Score
       (winner node)
             |
             v
  +----------+----------+
  |   3. BINDING         |   "Assign pod to node"
  |                     |
  |   Update pod's      |
  |   spec.nodeName     |
  |   via API server    |
  +---------------------+

A Concrete Example

Imagine you have three nodes and submit a pod requesting 2 CPUs and 4 GiB of memory:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:latest
    resources:
      requests:
        cpu: "2"
        memory: "4Gi"

The scheduler runs through its pipeline:

  Node A: 8 CPU, 16 Gi (6 CPU free, 10 Gi free)   --> PASS filter
  Node B: 4 CPU, 8 Gi  (1 CPU free, 3 Gi free)    --> FAIL (not enough CPU)
  Node C: 8 CPU, 16 Gi (4 CPU free, 12 Gi free)   --> PASS filter

  Scoring (LeastRequestedPriority):
    Node A: 6 free -> score 75    (more headroom)
    Node C: 4 free -> score 50

  Winner: Node A

The scheduler then creates a binding — it updates the pod’s spec.nodeName to Node A, and the kubelet on Node A starts the container.

The Scheduling Framework

Internally, the two-step filter-score pipeline is implemented via the Scheduling Framework, a plugin-based architecture introduced in Kubernetes 1.15. The framework defines extension points — hooks where plugins can inject logic:

  Pod enters queue
       |
       v
  [PreEnqueue]     Can this pod enter the active queue?
       |
       v
  [QueueSort]      Order pods in the queue (only one plugin allowed)
       |
       v
  === Scheduling Cycle (runs serially per pod) ===
       |
       v
  [PreFilter]      Pre-process pod info, check cluster conditions
       |
       v
  [Filter]         Remove infeasible nodes (runs concurrently per node)
       |
       +--> No feasible nodes? --> [PostFilter] (e.g., preemption)
       |
       v
  [PreScore]       Generate shared state for scoring
       |
       v
  [Score]          Rank feasible nodes (integer scores)
       |
       v
  [NormalizeScore] Scale scores to a common range
       |
       v
  [Reserve]        Optimistically claim resources (stateful)
       |
       v
  [Permit]         Final gate: approve, deny, or wait
       |
       v
  === Binding Cycle (runs concurrently) ===
       |
       v
  [PreBind]        Pre-binding work (e.g., provision network volume)
       |
       v
  [Bind]           Assign pod to node (update API server)
       |
       v
  [PostBind]       Cleanup, logging, metrics

Each built-in scheduling feature is a plugin. For example:

Plugin	Extension Points	What It Does
`NodeResourcesFit`	Filter, Score	Checks CPU/memory requests fit; scores by utilization
`NodeAffinity`	Filter, Score	Enforces `nodeAffinity` rules from pod spec
`TaintToleration`	Filter, Score	Matches taints on nodes with tolerations on pods
`InterPodAffinity`	Filter, Score	Enforces pod affinity/anti-affinity
`PodTopologySpread`	Filter, Score	Spreads pods across topology domains (zones, nodes)
`VolumeBinding`	Filter, Score	Ensures PVs are available on the selected node

This plugin architecture makes the scheduler extensible. You can enable, disable, or reorder plugins via KubeSchedulerConfiguration. You can also write your own plugins and compile them into a custom scheduler binary.

Running Multiple Schedulers

Kubernetes supports running multiple schedulers side by side. Each pod specifies which scheduler should handle it via the spec.schedulerName field:

apiVersion: v1
kind: Pod
metadata:
  name: tidb-pd-0
spec:
  schedulerName: tidb-scheduler   # <-- use TiDB's custom scheduler
  containers:
  - name: pd
    image: pingcap/pd:latest

If schedulerName is omitted, the pod goes to default-scheduler (kube-scheduler). Each scheduler watches only for pods addressed to it.

A custom scheduler is deployed as a regular Deployment in the cluster:

  +-----------------------------------------------------+
  |                Kubernetes Cluster                    |
  |                                                     |
  |  Control Plane                                      |
  |  +------------------+   +----------------------+    |
  |  | default-scheduler|   |   tidb-scheduler     |    |
  |  | (kube-scheduler) |   |   (custom)           |    |
  |  +--------+---------+   +----------+-----------+    |
  |           |                        |                |
  |           |  watches pods with     |  watches pods  |
  |           |  schedulerName=        |  with          |
  |           |  "default-scheduler"   |  schedulerName=|
  |           |                        |  "tidb-scheduler"|
  |           v                        v                |
  |  +--------+---------+   +----------+-----------+    |
  |  |   nginx pod      |   |   tikv-0 pod         |    |
  |  |   redis pod      |   |   pd-0 pod           |    |
  |  |   my-app pod     |   |   tidb-0 pod         |    |
  |  +------------------+   +----------------------+    |
  +-----------------------------------------------------+

Case Study: TiDB’s Custom Scheduler

TiDB is a distributed NewSQL database. Its architecture has three main components that run as pods in Kubernetes:

PD (Placement Driver): The brain — manages metadata and timestamps. Uses Raft consensus, so it needs an odd number of replicas (typically 3 or 5).
TiKV: The storage engine — stores data in Raft groups with 3 replicas by default.
TiDB: The SQL layer — stateless, can run anywhere.

The default Kubernetes scheduler doesn’t understand TiDB’s replication topology. If it happens to place 2 out of 3 PD pods on the same node, and that node goes down, PD loses its Raft quorum and the entire cluster becomes unavailable. TiDB solves this with tidb-scheduler.

How tidb-scheduler Works

The TiDB Operator deploys tidb-scheduler as a scheduler extender. When the TidbCluster custom resource specifies tidb-scheduler, the pod templates include schedulerName: tidb-scheduler. The scheduler watches for these pods and applies component-specific predicates (filters).

The core logic lives in pkg/scheduler/scheduler.go:

// Filter selects eligible nodes for a TiDB component pod.
func (s *scheduler) Filter(args *extender.Args) (*extender.Result, error) {
    pod := args.Pod
    component := pod.Labels["app.kubernetes.io/component"]

    // Select predicates based on component type
    var predicates []Predicate
    switch component {
    case "pd", "tikv":
        predicates = append(predicates, NewHA())    // HA spreading
    case "tidb":
        if featureGate.Enabled(StableScheduling) {
            predicates = append(predicates, NewStableScheduling())
        }
    }

    // Apply each predicate to narrow down feasible nodes
    nodes := args.Nodes
    for _, predicate := range predicates {
        nodes = predicate.Filter(nodes, pod)
    }
    return &extender.Result{Nodes: nodes}, nil
}

The HA Predicate: Spreading Pods Across Failure Domains

The HA predicate in pkg/scheduler/predicates/ha.go ensures PD and TiKV pods are spread across topology domains (nodes, racks, or zones).

The topology key is configurable via an annotation on the TidbCluster. The default is kubernetes.io/hostname (spread across nodes). You can set it to topology.kubernetes.io/zone for zone-level spreading.

PD spreading rule: No more than a minority of replicas on one topology. The formula:

$\text{maxPodsPerTopology} = \left\lfloor \frac{\text{replicas} + 1}{2} \right\rfloor - 1, \text{ minimum } 1$

For 3 PD replicas: $\lfloor(3+1)/2\rfloor - 1 = 1$ . So at most 1 PD pod per node. This means losing any single node still leaves a majority (2 out of 3) alive — quorum is preserved.

  3 PD replicas across 3 nodes (max 1 per node):

  +--------+   +--------+   +--------+
  | Node A |   | Node B |   | Node C |
  |        |   |        |   |        |
  | [pd-0] |   | [pd-1] |   | [pd-2] |
  +--------+   +--------+   +--------+

  Node B goes down:
  - pd-0 on A: alive
  - pd-1 on B: LOST
  - pd-2 on C: alive
  -> 2/3 alive = quorum maintained, cluster stays available

TiKV spreading rule: TiKV uses 3-copy Raft groups by default. With 3+ replicas, the scheduler requires at least 3 topology domains and distributes evenly:

$\text{maxPodsPerTopology} = \left\lceil \frac{\text{replicas}}{3} \right\rceil$

For 5 TiKV replicas: $\lceil 5/3 \rceil = 2$ . Pods distribute as 1-2-2 or 2-2-1 across 3 nodes.

  5 TiKV replicas across 3 nodes (max 2 per node):

  +----------+   +----------+   +----------+
  | Node A   |   | Node B   |   | Node C   |
  |          |   |          |   |          |
  | [tikv-0] |   | [tikv-1] |   | [tikv-3] |
  |          |   | [tikv-2] |   | [tikv-4] |
  +----------+   +----------+   +----------+

  Node B goes down:
  - Lost 2 out of 5 TiKV instances
  - Each Raft group has 3 replicas spread across nodes
  - At most 1 replica per group lost -> quorum maintained

Scheduling Serialization

A subtle problem: if the scheduler evaluates two TiKV pods concurrently, both might see the same cluster state and both decide to land on the same node, violating the HA constraint.

tidb-scheduler solves this with scheduling serialization. It uses an annotation (AnnPVCPodScheduling) as a lock:

  tikv-0 scheduling:
    1. Set annotation "scheduling=tikv-0" on TidbCluster
    2. Filter nodes, pick Node A
    3. Bind tikv-0 to Node A
    4. Wait for PVC to bind
    5. Clear annotation

  tikv-1 scheduling:
    1. See annotation "scheduling=tikv-0" -> wait
    2. Annotation cleared -> proceed
    3. Set annotation "scheduling=tikv-1"
    4. Filter nodes (now sees tikv-0 on Node A)
    5. Pick Node B
    ...

This ensures each pod sees the effect of previously scheduled pods, maintaining HA invariants.

Where Karpenter Fits In

Karpenter is not a scheduler. It is a node provisioner that works alongside the scheduler.

Here is the sequence when a pod cannot be scheduled due to lack of capacity:

  1. You submit a pod requesting 8 CPUs
  2. kube-scheduler tries to place it
  3. No node has 8 free CPUs -> pod marked Unschedulable
  4. Karpenter detects the Unschedulable pod
  5. Karpenter provisions a new EC2 instance (node)
  6. New node registers with the cluster
  7. kube-scheduler sees the new node, places the pod

  +--------------------+       +-------------------+
  |   kube-scheduler   |       |    Karpenter      |
  |                    |       |                   |
  |  "Where should     |       |  "Do we need      |
  |   this pod run?"   |       |   more nodes?"    |
  +--------+-----------+       +--------+----------+
           |                            |
           v                            v
  +--------+-----------+       +--------+----------+
  |  Assigns pods to   |       |  Creates/removes  |
  |  existing nodes    |       |  cloud VMs        |
  +--------------------+       +-------------------+

Karpenter vs. Cluster Autoscaler

Before Karpenter, the standard tool for node provisioning was Cluster Autoscaler. The key differences:

	Cluster Autoscaler	Karpenter
Scope	Scales existing node groups	Provisions individual nodes
Instance selection	Fixed instance types per group	Chooses instance types dynamically
Speed	Minutes (adjusts ASG desired count)	Seconds (launches instances directly)
Consolidation	Limited	Actively consolidates underutilized nodes
Cloud support	Multi-cloud	Primarily AWS (community providers for Azure/GCP)

Karpenter’s NodePool custom resource defines constraints:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64", "arm64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand", "spot"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["m5.xlarge", "m5.2xlarge", "m6g.xlarge"]
  limits:
    cpu: "100"       # max 100 CPUs across all nodes in this pool
  disruption:
    consolidateAfter: 30s

When Karpenter sees an unschedulable pod, it picks the cheapest instance type from the allowed list that satisfies the pod’s resource requests, launches it, and the scheduler takes over.

Putting It All Together

Here is how all the pieces interact in a TiDB-on-Kubernetes deployment:

  kubectl apply TidbCluster CR
         |
         v
  TiDB Operator (tidb-controller-manager)
  - Creates StatefulSets for PD, TiKV, TiDB
  - Sets schedulerName: tidb-scheduler in pod templates
         |
         v
  Pods created (Unscheduled, schedulerName=tidb-scheduler)
         |
         v
  tidb-scheduler picks up pods
  - Applies HA predicate: spread across nodes/zones
  - Filters nodes, scores, binds
         |
         +--> Not enough nodes?
         |         |
         |         v
         |    Pod stays Unschedulable
         |         |
         |         v
         |    Karpenter detects it
         |    - Provisions new node matching constraints
         |    - Node joins cluster
         |         |
         |         v
         |    tidb-scheduler retries
         |    - New node is now feasible
         |
         v
  Pod bound to node
  - kubelet starts container
  - PD/TiKV joins the cluster

Summary

Component	Role	When It Acts
kube-scheduler	Assigns pods to existing nodes	Pod created without `nodeName`
Scheduling Framework	Plugin-based filter/score pipeline	Inside kube-scheduler
Custom scheduler (e.g., tidb-scheduler)	Domain-specific scheduling logic	Pods with matching `schedulerName`
Karpenter	Provisions/removes cloud nodes	Pods stuck as Unschedulable
Cluster Autoscaler	Scales node groups up/down	Pods stuck as Unschedulable

The scheduler answers “where should this pod run?” Karpenter answers “do we have enough infrastructure to run it?” Custom schedulers like tidb-scheduler answer “where should this pod run given my application’s topology requirements?”

References

Kubernetes docs, kube-scheduler doc
Kubernetes docs, Scheduling Framework doc
Kubernetes docs, Configure Multiple Schedulers doc
Karpenter concepts doc
TiDB Operator architecture doc
TiDB Operator scheduler source pkg/scheduler/scheduler.go
TiDB Operator HA predicate pkg/scheduler/predicates/ha.go
Kubernetes Scheduler Plugins repo