[Protobuf] How Varint encoding works and the importance of field numbers

What is a protocol buffer?
1. Protobuf Basics
2. Importance of field numbers
wire format
Usage scenarios for Protobuf

What is a protocol buffer?

Protocol Buffers (protocol buffers, protobufs) are a serialized form of structured data, developed by Google.
Protobuf serializes data more efficiently than XML or JSON, with smaller message sizes and faster parsing.
Basically, when the app and server exchange data using Protobuf, both must share the same message definition (.proto file). This is necessary for the message to be serialized and deserialized correctly.

Protobuf Basics

.proto file: The schema of a protobuf is described in the .proto file. This file defines the message type and specifies the name, type, and number of the field.
Message: The basic unit used to define the data structure in a protobuf. A message has one or more fields.
Field types: Fields can be of various types, including scalar types (numeric, string, boolean, etc.), other messages, and enumerated types.
Field Number: Each field is assigned a tag number, which uniquely identifies the field. This number plays an important role in wire formatting.

Protocol Buffers

syntax = "proto3";

message Person {
  string name = 1;
  int32 id = 2;
  bool has_pet = 3;
}

Language Guide (proto 3)

Covers how to use the proto3 revision of the Protocol Buffers language in your project.

Importance of field numbers

Field numbers are used to uniquely identify each field within the serialized data. When the receiver deserializes the message, it uses these numbers to properly interpret the data in each field and reconstruct the original message structure.

Therefore, field numbers must be unique within a message definition, and once used, numbers cannot be changed (for backward compatibility).

For example, if you set the values name = "Alice", id = 123, has_pet = true for the message definition just described, the entire serialized message will be a sequence of bytes like this

Protocol Buffers

0a 05 41 6c 69 63 65 10 7b 18 01

In the entire message byte sequence, the field number is encoded in a key that precedes each field value. This key is a calculated value based on the field number and the type of that field.

0a: Key of the name field (field number 1).
05: varint for length of string “Alice”.
41 6c 69 63 65: UTF-8 encoding of “Alice”.
10: Key for the id field (field number 2). In this case, the wire type is 0 (varint) and the field number is 2. Therefore, the key is (2 << 3) | 0 for 10.
7b: varint representing the number 123
18: Key for has_pet field (field number 3). The wire type is 0 and the field number is 3. the key is (3 << 3) | 0 which is 18.
01: varint representing true ( has_pet)

The receiver reads this byte sequence from the beginning and parses each key to identify which fields have what type of data. The values of each field are then restored using appropriate decoding techniques. Through this process, the original message structure is accurately reproduced.

wire format

The wire format is a format in which messages are serialized as a sequence of bytes. It is very efficient and has very little overhead in parsing and serializing messages. The varint encoding is used in the wire format process.

Encoding

Explains how Protocol Buffers encodes data to files or to the wire.

Basic Varint encoding mechanism

Varint can vary the number of bytes used depending on the size of the integer, thus efficiently keeping the size of the data small.
The most significant bit of each byte (MSB, the leftmost bit) is called the continuation bit and indicates whether the next byte is also part of varint. If this bit is 1, it means that the next byte is also part of this number.
The remaining 7 bits of each byte are used as the payload, and these bits are combined to form the final number.

Protocol Buffers

message Test1 {
  optional int32 a = 1;
}

If we set field a to 150 in the Test1 message, assuming that field a is the first field, the encoded message would be the byte sequence 08 96 01.

Size comparison after numerical serialization (VarInt, JSON, XML)

Compare data size when serializing the number 150 in VarInt, JSON, and XML

VarInt: 96 01 and 2 bytes
JSON: Simply the string 150; in UTF-8, standard ASCII characters (including digits 0-9 ) are encoded in 1 byte, so 150 requires 3 bytes. For objects containing keys (e.g., { "a": 150}), additional characters must also be taken into account. In this example, {,", a, ", :, space (optional), 1, 5, 0, } are included, for a total of 10 bytes (9 bytes if spaces are not included).
XML: <a>150</a>. <a> is 3 bytes, 150 is 3 bytes, and </a> is 4 bytes, for a total of 10 bytes required.

So, VarInt (2 bytes), JSON (9-10 bytes), and XML (10 bytes), and Protouf using VarInt can reduce the data size.

Why is it 9601 if you varint-encode 150?

Converting 150 to binary would be 10010110. However, Protocol Buffers varint encodes the number in a series of 7-bit groups. Then, one bit is added before each group to indicate whether that group is the last or not. This is the continuation bit.

If the continuation bit is 1, it means “numbers still going on”; if it is 0, it is the last group.

To encode the binary notation 10010110 of 150 in varint, it should be divided into every 7 bits, adding continuation bits if necessary. However, 150 is too many to be represented by 7 bits (150>7 squared of 2=128), so it must be divided into two bytes.

The first 7 bits (counting from the right) are 0010110, which is the least significant 7 bits of 150.
The remainder is 0000001. These are the upper bits.

Add continuation bits before these groups

Add continuation bit 1 to 0010110 to make it 10010110. This means “still continues”.
Add continuation bit 0 to 0000001 to make it 00000001. This means “this is the end”.

The final result is a sequence of bytes 10010110 00000001. This is the result of encoding 150 with varint.

Backward compatibility with Protobuf

Protobuf is backward compatible. By adding new fields without deleting existing ones, existing deployed services and new services remain compatible.
Instead of deleting the field, it is recommended to deprecate it and add a new field.

Protocol Buffers

// 旧バージョン
message Person {
  string name = 1;
  int32 id = 2;
}

// 新バージョン（フィールドの追加）
message Person {
  string name = 1;
  int32 id = 2;
  bool has_pet = 3; // 新しいフィールド
}

// 新バージョン（フィールドの削除）
message Person {
  string name = 1;
  int32 id = 2;
  // has_pet フィールドは削除されました
}

Advanced field type

Repeated Fields: Used when the same field appears multiple times. That is, multiple values can be in one field. They are treated as arrays or lists.
Map Fields: Fields to hold key/value pairs.
Oneof Fields: Ensures that only one of several fields has a value at the same time.

Protocol Buffers

syntax = "proto3";

message Person {
  string name = 1;
  int32 id = 2;
  bool has_pet = 3;
  repeated string emails = 4;
  map attributes = 5;

  oneof contact_info {
    string email = 6;
    string phone = 7;
  }
}

Incidentally, if this data is serialized by Protobuf, it will be binary data as shown below. In reality, this binary data is sent and received during communication.

Protocol Buffers

0a 05 41 6c 69 63 65 10 7b 18 01 22 12 61 6c 69 63 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 2a 15 61 6c 69 63 65 2e 77 6f 72 6b 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 32 0e 0a 03 61 67 65 12 02 33 30 32 10 0a 03 63 69 74 79 12 08 4e 65 77 20 59 6f 72 6b 3a 12 63 6f 6e 74 61 63 74 40 61 6c 69 63 65 2e 63 6f 6d

The above message definition means the following structure in json.

Protocol Buffers

{
  "name": "Alice",
  "id": 123,
  "has_pet": true,
  "emails": [
    "alice@example.com",
    "alice.work@example.com"
  ],
  "attributes": {
    "age": "30",
    "city": "New York"
  },
  "email": "contact@alice.com"
}

service definition

Protobuf can define RPC services as well as data messages. This allows it to be used in conjunction with RPC frameworks such as gRPC.

Protocol Buffers

syntax = "proto3";

service PersonService {
  rpc GetPerson (PersonRequest) returns (PersonResponse);
}

message PersonRequest {
  int32 id = 1;
}

message PersonResponse {
  Person person = 1;
}

Usage scenarios for Protobuf

Microservice Communication

Background: In a microservice architecture, multiple services need to communicate with each other. It is important that this communication is efficient.
Example of use: Using Protobufs in data exchange between services reduces message size and communication overhead.

Mobile Applications

BACKGROUND: Mobile applications involve frequent data transmission. Bandwidth constraints and battery consumption issues require efficient data serialization.
Example of use: Performance can be improved by serializing data retrieved from APIs and data exchanged between applications with Protobuf.

game development

Background: Game development requires the efficient exchange of large amounts of data in real time.
Use case: Use Protobuf in communications between game servers and clients to reduce communication latency and improve game responsiveness.

How to actually define a protobuf message in a .proto file and compile it for use with the go language h here.