Note: This is a personal study note based on a system I worked on. It’s not a tutorial—just my attempt to internalize the key concepts. Details have been generalized for clarity.
This post covers dynamic gRPC routing using Envoy Proxy and an xDS control plane.
The Problem
In Kubernetes, pods are ephemeral. When a user requests a compute resource, the current system creates a new pod with a dynamically assigned IP. To route traffic to the correct pod, we need a way to dynamically update the routing configuration.
Architecture Overview
|
|
| Component | Location | Role |
|---|---|---|
| 💻 Client | User’s server | Initiates gRPC connection with authority header |
| 🚀 Envoy Proxy | Our K8s | TLS termination, L7 routing based on Host header |
| 🧠 Python xDS | Our K8s | Watches K8s, pushes route configs to Envoy |
| 📦 User Pod | Our K8s | Ephemeral backend, receives routed traffic |
xDS Mechanism
xDS (Anything Discovery Service) is Envoy’s API for dynamic configuration. Unlike static configs, xDS treats routing rules as data that can be updated at runtime without restart.
Key APIs
| API | Purpose | Example |
|---|---|---|
| CDS (Cluster) | Define backend endpoints | “Cluster user-123 points to 10.1.2.3:50051” |
| RDS (Route) | Define routing rules | “Host session-123.example.org → Cluster user-123” |
Flow
- User requests a pod → K8s creates pod
- Pod Manager detects pod is ready
- Pod Manager calls
POST /xds/tasksto register the route - xDS server adds the route to its internal list
- Envoy polls CDS/RDS → receives updated config
- Client connects with the assigned
Hostheader - Envoy routes traffic to the pod
- On pod termination → Pod Manager calls
DELETE /xds/tasks
Envoy Configuration
Key points for envoy.yaml:
1. HTTP Connection Manager (L7)
gRPC runs over HTTP/2. Use envoy.http_connection_manager, not TCP proxy:
|
|
2. TLS with ALPN
ALPN negotiation is required for HTTP/2:
|
|
3. Dynamic Resources
Point CDS/RDS to your xDS server:
|
|
Python xDS Server
The control plane exposes two REST endpoints:
CDS Endpoint (/v2/discovery:clusters)
Returns available clusters:
|
|
RDS Endpoint (/v2/discovery:routes)
Returns routing rules:
|
|
Retrospective (added: Sep 2024)
1. Replace REST Polling with ADS (gRPC Streaming)
The 5-second polling interval is the biggest weakness. In a dynamic environment, this creates a traffic blackhole during pod scaling or crashes.
Better approach: Use ADS (Aggregated Discovery Service) — a persistent gRPC stream where the control plane pushes updates to Envoy in real-time.
| Approach | Update Latency | Mechanism |
|---|---|---|
| REST Polling (current) | ~5 seconds | Envoy pulls config periodically |
| ADS Streaming | < 500ms end-to-end | Control plane pushes on change |
2. Rewrite Control Plane in Go
Maintaining a custom Python xDS server (using the deprecated v2 API) is technical debt.
Better approach: Use go-control-plane, the official Envoy SDK used by Istio, Contour, and other production systems.
Benefits:
- Type safety: Protobuf-generated Go structs catch errors at compile time
- SnapshotCache: Built-in diffing and version management
- Concurrency: Go handles K8s Watch events and gRPC streams efficiently
3. Use Delta xDS for Incremental Updates
The current design requires sending the entire config on every change. With many sessions, this becomes expensive.
Better approach: Envoy v3 API supports Incremental (Delta) xDS — only send the added/removed resources instead of the full state.
| Mode | Payload | Use Case |
|---|---|---|
| State of the World | Full config every time | Small, infrequent changes |
| Delta/Incremental | Only +/- changes |
Large config, frequent updates |
These improvements would keep Envoy’s strengths (HTTP/2, gRPC, TLS termination, observability) while eliminating the control plane bottlenecks.