Dynamic gRPC Routing with Envoy xDS

Note: This is a personal study note based on a system I worked on. It’s not a tutorial—just my attempt to internalize the key concepts. Details have been generalized for clarity.

This post covers dynamic gRPC routing using Envoy Proxy and an xDS control plane.

The Problem

In Kubernetes, pods are ephemeral. When a user requests a compute resource, the current system creates a new pod with a dynamically assigned IP. To route traffic to the correct pod, we need a way to dynamically update the routing configuration.


Architecture Overview

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
    [User's Infrastructure]              [Our K8s Cluster]
  ┌─────────────────────┐         ┌────────────────────────────────┐
  │                     │         │                                │
  │  ┌─────────────┐    │   TLS   │  ┌─────────────────┐           │
  │  │   Client    │────┼────────▶│  │  Envoy Proxy    │           │
  │  │  (gRPC)     │    │         │  │  (Data Plane)   │           │
  │  └─────────────┘    │         │  └────────┬────────┘           │
  │                     │         │           │                    │
  └─────────────────────┘         │           ▼                    │
                                  │  ┌─────────────┐               │
                                  │  │  User Pod   │               │
                                  │  │  10.1.2.3   │               │
                                  │  └─────────────┘               │
                                  │           ▲                    │
                                  │           │ xDS API            │
                                  │  ┌────────┴────────┐           │
                                  │  │  Python xDS     │◀─ K8s API │
                                  │  │ (Control Plane) │           │
                                  │  └─────────────────┘           │
                                  └────────────────────────────────┘
Component Location Role
💻 Client User’s server Initiates gRPC connection with authority header
🚀 Envoy Proxy Our K8s TLS termination, L7 routing based on Host header
🧠 Python xDS Our K8s Watches K8s, pushes route configs to Envoy
📦 User Pod Our K8s Ephemeral backend, receives routed traffic

xDS Mechanism

xDS (Anything Discovery Service) is Envoy’s API for dynamic configuration. Unlike static configs, xDS treats routing rules as data that can be updated at runtime without restart.

Key APIs

API Purpose Example
CDS (Cluster) Define backend endpoints “Cluster user-123 points to 10.1.2.3:50051
RDS (Route) Define routing rules “Host session-123.example.org → Cluster user-123

Flow

  1. User requests a pod → K8s creates pod
  2. Pod Manager detects pod is ready
  3. Pod Manager calls POST /xds/tasks to register the route
  4. xDS server adds the route to its internal list
  5. Envoy polls CDS/RDS → receives updated config
  6. Client connects with the assigned Host header
  7. Envoy routes traffic to the pod
  8. On pod termination → Pod Manager calls DELETE /xds/tasks

Envoy Configuration

Key points for envoy.yaml:

1. HTTP Connection Manager (L7)

gRPC runs over HTTP/2. Use envoy.http_connection_manager, not TCP proxy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
filter_chains:
  - filters:
      - name: envoy.http_connection_manager
        config:
          codec_type: auto
          stat_prefix: ingress_http
          rds:
            route_config_name: "my_routes"
            config_source:
              api_config_source:
                api_type: REST
                cluster_names: [xds_cluster]
                refresh_delay: 5s
          http_filters:
            - name: envoy.router
              config: {}

2. TLS with ALPN

ALPN negotiation is required for HTTP/2:

1
2
3
4
5
6
7
8
tls_context:
  common_tls_context:
    alpn_protocols: "h2"
    tls_certificates:
      - certificate_chain:
          filename: "/certs/server.crt"
        private_key:
          filename: "/certs/server.key"

3. Dynamic Resources

Point CDS/RDS to your xDS server:

1
2
3
4
5
6
dynamic_resources:
  cds_config:
    api_config_source:
      api_type: REST
      cluster_names: [xds_cluster]
      refresh_delay: 5s

Python xDS Server

The control plane exposes two REST endpoints:

CDS Endpoint (/v2/discovery:clusters)

Returns available clusters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@app.route('/v2/discovery:clusters', methods=['POST'])
def clusters():
    return {
        "version_info": "1",
        "resources": [
            {
                "@type": "type.googleapis.com/envoy.api.v2.Cluster",
                "name": "user-123",
                "connect_timeout": "1s",
                "type": "strict_dns",
                "lb_policy": "ROUND_ROBIN",
                "http2_protocol_options": {},
                "hosts": [
                    {
                        "socket_address": {
                            "address": "pod-abc123.default.svc.cluster.local",
                            "port_value": 50051
                        }
                    }
                ]
            }
        ]
    }

RDS Endpoint (/v2/discovery:routes)

Returns routing rules:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
@app.route('/v2/discovery:routes', methods=['POST'])
def routes():
    return {
        "version_info": "1",
        "resources": [
            {
                "@type": "type.googleapis.com/envoy.api.v2.RouteConfiguration",
                "name": "my_routes",
                "virtual_hosts": [{
                    "name": "user-123-route",
                    "domains": "session-abc123.example.com:17888",
                    "routes": [{
                        "match": {"prefix": "/", "grpc": {}},
                        "route": {
                            "cluster": "user-123",
                            "timeout": "0s"
                        }
                    }]
                }]
            }
        ]
    }

Retrospective (added: Sep 2024)

1. Replace REST Polling with ADS (gRPC Streaming)

The 5-second polling interval is the biggest weakness. In a dynamic environment, this creates a traffic blackhole during pod scaling or crashes.

Better approach: Use ADS (Aggregated Discovery Service) — a persistent gRPC stream where the control plane pushes updates to Envoy in real-time.

Approach Update Latency Mechanism
REST Polling (current) ~5 seconds Envoy pulls config periodically
ADS Streaming < 500ms end-to-end Control plane pushes on change

2. Rewrite Control Plane in Go

Maintaining a custom Python xDS server (using the deprecated v2 API) is technical debt.

Better approach: Use go-control-plane, the official Envoy SDK used by Istio, Contour, and other production systems.

Benefits:

  • Type safety: Protobuf-generated Go structs catch errors at compile time
  • SnapshotCache: Built-in diffing and version management
  • Concurrency: Go handles K8s Watch events and gRPC streams efficiently

3. Use Delta xDS for Incremental Updates

The current design requires sending the entire config on every change. With many sessions, this becomes expensive.

Better approach: Envoy v3 API supports Incremental (Delta) xDS — only send the added/removed resources instead of the full state.

Mode Payload Use Case
State of the World Full config every time Small, infrequent changes
Delta/Incremental Only +/- changes Large config, frequent updates

These improvements would keep Envoy’s strengths (HTTP/2, gRPC, TLS termination, observability) while eliminating the control plane bottlenecks.

comments powered by Disqus