System Design - How Protocol Buffers (Protobuf) Encoding Works

Open Table of contents

Context
The .proto Schema
Wire Types
Varint Encoding: The Core Trick
Encoding a Complete Message
How the Decoder Works
Signed Integers: ZigZag Encoding
Nested Messages and Repeated Fields
Source Code: Google’s C++ Implementation
Why Not Just Use JSON?
How Protobuf Achieves Forward and Backward Compatibility
References

Context

When two services communicate over a network, they need to agree on how to turn structured data (objects, structs, messages) into bytes that travel over the wire, and back again. This is serialization.

You could use JSON. It is human-readable and universal. But JSON is text-based: numbers become strings of ASCII digits, field names repeat in every single message, and parsing requires scanning for quotes, colons, and commas character by character.

In high-throughput systems — think a microservice handling 100,000 RPC calls per second — JSON’s overhead adds up. Google invented Protocol Buffers (protobuf) in 2001 to solve this at scale. Today, protobuf is the default serialization format behind gRPC, used by Google, Netflix, Square, and many others.

Protobuf messages are:

Smaller than JSON (no field name repetition, compact numeric encoding)
Faster to parse (no string scanning — fixed rules decode each byte)
Schema-driven (a .proto file defines the structure at compile time)

Let’s look at exactly how bytes are laid out on the wire.

The .proto Schema

Before encoding, you define your message structure:

syntax = "proto3";

message Person {
  string name   = 1;   // field number 1
  int32  age    = 2;   // field number 2
  string email  = 3;   // field number 3
}

The numbers (1, 2, 3) are field tags — not values, but identifiers. They are the key to protobuf’s compactness: on the wire, field names like "name" or "email" never appear. Only the small integer tag travels with the data.

Wire Types

Every piece of data on the wire is prefixed by a tag byte that encodes two things:

The field number (which field is this?)
The wire type (how many bytes follow?)

Tag byte layout:

  +--+--+--+--+--+--+--+--+
  |  field_number  |  type |
  +--+--+--+--+--+--+--+--+
   bits 7..3         bits 2..0

  tag = (field_number << 3) | wire_type

There are six wire types, but most data uses three:

  Wire Type   Meaning                  Used For
  ---------   ----------------------   ---------------------------
  0           Varint                   int32, int64, uint32, bool
  1           64-bit fixed             double, fixed64
  2           Length-delimited         string, bytes, embedded msg
  5           32-bit fixed             float, fixed32

Wire types 3 and 4 (start/end group) are deprecated.

Varint Encoding: The Core Trick

The most important encoding in protobuf is the varint — a variable-length integer that uses fewer bytes for smaller numbers.

The rule is simple: each byte uses 7 bits for data and 1 bit (the MSB) as a continuation flag.

  Encoding the number 300:

  300 in binary = 1 0010 1100

  Split into 7-bit groups (little-endian order):
    group 1:  010 1100   (low 7 bits)
    group 2:  000 0010   (next 7 bits)

  Add continuation bits (MSB = 1 means "more bytes follow"):
    byte 1:  1_010 1100  = 0xAC  (MSB=1: more coming)
    byte 2:  0_000 0010  = 0x02  (MSB=0: last byte)

  On the wire: AC 02

Small numbers (0–127) fit in a single byte. The number 1 is just 0x01. This is why protobuf is compact for typical data — most field tags and many values are small.

  Varint examples:

  Value     Encoded bytes       Size
  -----     ---------------     ----
  1         01                  1 byte
  127       7F                  1 byte
  128       80 01               2 bytes
  300       AC 02               2 bytes
  16384     80 80 01            3 bytes

Encoding a Complete Message

Let’s encode a Person message with concrete values:

Person {
  name  = "Al"     // field 1, wire type 2 (length-delimited)
  age   = 25       // field 2, wire type 0 (varint)
  email = "a@b.c"  // field 3, wire type 2 (length-delimited)
}

Step by step:

  Field: name (field_number=1, wire_type=2)
  Tag:   (1 << 3) | 2 = 0x0A
  Length: 2 (two bytes of UTF-8)
  Data:  0x41 0x6C ("Al")

  Field: age (field_number=2, wire_type=0)
  Tag:   (2 << 3) | 0 = 0x10
  Data:  0x19 (varint for 25)

  Field: email (field_number=3, wire_type=2)
  Tag:   (3 << 3) | 2 = 0x1A
  Length: 5
  Data:  0x61 0x40 0x62 0x2E 0x63 ("a@b.c")

  Complete message on the wire (14 bytes):

  0A 02 41 6C 10 19 1A 05 61 40 62 2E 63
  |     |     |  |  |     |
  |     "Al"  |  25 |     "a@b.c"
  tag1        tag2  tag3

The equivalent JSON {"name":"Al","age":25,"email":"a@b.c"} is 38 bytes — nearly 3x larger.

How the Decoder Works

Decoding is a simple loop:

  decode(bytes):
      while bytes remaining:
          tag_byte = read_varint(bytes)
          field_number = tag_byte >> 3
          wire_type    = tag_byte & 0x07

          switch wire_type:
              case 0:  value = read_varint(bytes)
              case 1:  value = read_fixed_64(bytes)
              case 2:  length = read_varint(bytes)
                       value  = read_bytes(bytes, length)
              case 5:  value = read_fixed_32(bytes)

          store(field_number, value)

  Decoding flow for our Person message:

  Bytes: 0A 02 41 6C 10 19 1A 05 61 40 62 2E 63
         ^
         |
  +------+------+------+------+------+------+
  | Read 0x0A   | tag: field=1, type=2       |
  | Read 0x02   | length = 2                 |
  | Read 2 bytes| "Al"                       |
  +-------------+----------------------------+
  | Read 0x10   | tag: field=2, type=0       |
  | Read 0x19   | varint = 25               |
  +-------------+----------------------------+
  | Read 0x1A   | tag: field=3, type=2       |
  | Read 0x05   | length = 5                 |
  | Read 5 bytes| "a@b.c"                    |
  +-------------+----------------------------+

Notice that the decoder does not need the .proto schema to skip fields. If it encounters an unknown field number, it knows exactly how many bytes to skip based on the wire type alone. This is how protobuf achieves forward compatibility — old code can skip new fields it doesn’t recognize.

Signed Integers: ZigZag Encoding

Standard varints are efficient for positive numbers, but negative numbers in two’s complement have their MSB set, making them large (10 bytes for any negative int64!). Protobuf solves this with ZigZag encoding for sint32/sint64 types:

  ZigZag maps signed integers to unsigned:

  Original    Encoded
  --------    -------
   0           0
  -1           1
   1           2
  -2           3
   2           4
  ...         ...

  Formula:  zigzag(n) = (n << 1) ^ (n >> 31)   // for int32

This keeps small-magnitude negative numbers small on the wire:

  Value    varint(sint32)   varint(int32)
  -----    --------------   -------------
  -1       01 (1 byte)      FF FF FF FF FF FF FF FF FF 01 (10 bytes!)
  -2       03 (1 byte)      FE FF FF FF FF FF FF FF FF 01 (10 bytes!)

Nested Messages and Repeated Fields

Embedded messages use wire type 2 (length-delimited) — they are just bytes within bytes:

message Address {
  string city = 1;
}
message Person {
  string  name    = 1;
  Address address = 4;
}

  Encoding Person { name="Jo", address={ city="NY" } }:

  0A 02 4A 6F        field 1: "Jo"
  22 04              field 4: length-delimited, 4 bytes follow
     0A 02 4E 59    nested Address: field 1 = "NY"
       ^--- this is a full protobuf message inside the outer one

Repeated fields (arrays) are encoded as packed varints for numeric types:

message Scores {
  repeated int32 values = 1;
}
// Scores { values = [3, 270, 86942] }

  Tag:    0x0A (field 1, wire type 2)
  Length: 0x06 (6 bytes of packed data)
  Data:   03 8E 02 9E A7 05

  Breakdown:
    03         -> 3
    8E 02      -> 270   (0001 0001110 -> remove continuation -> 270)
    9E A7 05   -> 86942

Source Code: Google’s C++ Implementation

The core encoding lives in google/protobuf/io/coded_stream.h. Key pieces:

Writing a varint (from coded_stream.cc):

uint8_t* CodedOutputStream::WriteVarint64ToArrayInline(
    uint64_t value, uint8_t* target) {
  while (value >= 0x80) {
    *target = static_cast<uint8_t>(value | 0x80);
    value >>= 7;
    ++target;
  }
  *target = static_cast<uint8_t>(value);
  return target + 1;
}

Reading a varint (from coded_stream.cc):

bool CodedInputStream::ReadVarint64Slow(uint64_t* value) {
  uint64_t result = 0;
  int count = 0;
  uint32_t b;
  do {
    b = ReadRaw(1);
    result |= static_cast<uint64_t>(b & 0x7F) << (7 * count);
    ++count;
  } while (b & 0x80);
  *value = result;
  return true;
}

The write path is branchless-optimized for 1-2 byte varints (the common case). The read path uses a loop but processes exactly one byte per iteration with no allocation.

Why Not Just Use JSON?

A side-by-side comparison:

  Feature            Protobuf                JSON
  -------            --------                ----
  Size               Compact (binary)        Verbose (text)
  Parse speed        O(n), no scanning       O(n), string scanning
  Schema             Required (.proto)       Optional
  Human-readable     No                      Yes
  Forward compat     Yes (skip unknown)      Partial
  Field ordering     Not required            Not required
  Null handling      Default values omitted  Explicit null

  Size comparison for 1000 Person records:

  +---------------------------+
  |  JSON:      ~45 KB        |
  |  Protobuf:  ~14 KB        |  ~3x smaller
  +---------------------------+
  |  JSON parse:   12 ms      |
  |  Proto parse:   2 ms      |  ~6x faster
  +---------------------------+
  (Approximate; varies by data and implementation)

The tradeoff is clear: if humans need to read the data (config files, APIs for browser clients), use JSON. If machines talk to machines at high volume (internal RPCs, storage formats, streaming pipelines), protobuf wins.

How Protobuf Achieves Forward and Backward Compatibility

The field-tag design makes schema evolution safe:

  Rule 1: Never reuse a field number
  Rule 2: New fields get new numbers
  Rule 3: Use 'reserved' for retired fields

  v1 schema:              v2 schema (new field added):
  message Person {        message Person {
    string name = 1;        string name = 1;
    int32  age  = 2;        int32  age  = 2;
  }                         string phone = 4;  // new!
                          }

  Old code reading v2 data:
    sees tag 4 → unknown field → reads wire_type to skip → done
    (no crash, no corruption)

  New code reading v1 data:
    never sees tag 4 → field 'phone' stays at default ("")
    (no crash, no corruption)

This is why protobuf field numbers are permanent. Once assigned, a number means that field forever.

References

Protocol Buffers encoding documentation: protobuf.dev/programming-guides/encoding
Protocol Buffers source code: github.com/protocolbuffers/protobuf
Google’s original protobuf paper (2008): “Protocol Buffers: Google’s Data Interchange Format”
Varint encoding: protobuf.dev/programming-guides/encoding/#varints
gRPC uses protobuf as its IDL and serialization format: grpc.io