How we prevented log flooding attacks from user-defined code by implementing buffering, batching, and rate limiting.
Situation
System Architecture
- 🏃 Runner Pod:
- Executes user Python scripts (untrusted environment)
- Isolated per user
- Managed as Kubernetes Jobs
- 🎛️ Manager:
- Creates/deletes Runner Jobs via Kubernetes API
- Receives logs via gRPC stream
- 🗄️ Database:
- Stores execution logs for user queries
The Incident
A user’s script contained an infinite loop that called the logging function at extremely high frequency.
The original design used synchronous writes — every gRPC log message triggered an immediate database INSERT. This caused:
- Connection pool exhaustion: Write requests saturated all available connections
- Disk I/O saturation: High-frequency small writes overwhelmed storage
- Platform-wide outage: All services depending on the database experienced timeouts (DoS)
Task
Redesign the log ingestion pipeline with these goals:
- 🛡️ Protect the DB: Single user’s misbehavior must not exhaust database resources
- 🔒 Isolation: Bad actors only affect themselves (lost logs), not other users
- 🔌 Decoupling: Manager must treat Runner as untrusted; defense lives server-side
Action
We replaced synchronous writes with asynchronous batch processing.
💡 The strategies below are ordered by defense priority (outer to inner layer). During the incident, we implemented B→C→D first as a hotfix; Strategy A (Rate Limiting) was identified retrospectively as the most critical first-line defense.
Strategy A: Rate Limiting (First-Line Defense)
Rate limiting at the gRPC receiver layer provides the earliest protection:
- Per-connection token bucket: Limit log messages per second per Runner stream
- Immediate rejection: Drop excessive messages with zero memory allocation
Strategy B: Buffering & Batching
Replaced “write-on-receive” with in-memory buffers.
Design Logic:
- Maintain per-task log buffer in memory
- On log receive → push to buffer
- Flush triggers (whichever comes first):
- Size threshold: Buffer reaches N entries (e.g., 50)
- Time threshold: T seconds since last flush (e.g., 2s)
- Use bulk insert:
INSERT INTO logs VALUES (...), (...), (...) - Time-based flush via background goroutine to handle low-frequency logs
Production Considerations:
- Use a fixed-size worker pool for DB writes — avoid spawning unbounded goroutines
- Implement backpressure: when workers are busy, block or drop new flushes
- Use sharded locks or per-task channels to reduce lock contention
- Implement task cleanup when Runner terminates to prevent memory leaks
Strategy C: Memory Protection
Prevent memory exhaustion from oversized payloads:
Design Logic:
- String truncation: Limit log message to 256 characters
- Buffer overflow guard: If buffer exceeds threshold, drop oldest entries
Production Considerations:
- Consider ring buffer instead of slice shifting for O(1) operations and stable memory
- Implement global memory budget, not just per-task limits
- Example: 100 entries × 10,000 tasks = 1M entries in memory — still dangerous
- Consider per-user quotas in addition to per-task limits
Strategy D: Execution Timeout
Added a hard timeout for user script execution. If a script runs longer than the allowed duration, the Runner Pod is terminated. This limits the attack window — even if an infinite loop starts flooding logs, the damage is bounded by time.
Limitations & Future Improvements
💡 The strategies above were implemented as a hotfix during an incident. For long-term stability, consider:
- Decouple log ingestion: Use message queue (Kafka, NATS) instead of direct DB writes
- Separate storage: Logs are secondary data — store in dedicated systems (Elasticsearch, Loki, ClickHouse) instead of transactional RDBMS
- Sidecar pattern: Let Kubernetes handle log collection via Fluentd/Fluent-bit
Result
- System maintained 100% availability under similar attack conditions
- Malicious user’s logs partially lost (buffer overflow), all other users unaffected
- Key lesson: backend must have absolute traffic control for user-generated code — never trust the client
- Trade-off accepted: dropping secondary data (logs) to preserve core services