Defending Against Database DoS from Untrusted Code

How we prevented log flooding attacks from user-defined code by implementing buffering, batching, and rate limiting.

Situation

System Architecture

🏃 Runner Pod:
- Executes user Python scripts (untrusted environment)
- Isolated per user
- Managed as Kubernetes Jobs
🎛️ Manager:
- Creates/deletes Runner Jobs via Kubernetes API
- Receives logs via gRPC stream
🗄️ Database:
- Stores execution logs for user queries

The Incident

A user’s script contained an infinite loop that called the logging function at extremely high frequency.

The original design used synchronous writes — every gRPC log message triggered an immediate database INSERT. This caused:

Connection pool exhaustion: Write requests saturated all available connections
Disk I/O saturation: High-frequency small writes overwhelmed storage
Platform-wide outage: All services depending on the database experienced timeouts (DoS)

Task

Redesign the log ingestion pipeline with these goals:

🛡️ Protect the DB: Single user’s misbehavior must not exhaust database resources
🔒 Isolation: Bad actors only affect themselves (lost logs), not other users
🔌 Decoupling: Manager must treat Runner as untrusted; defense lives server-side

Action

We replaced synchronous writes with asynchronous batch processing.

💡 The strategies below are ordered by defense priority (outer to inner layer). During the incident, we implemented B→C→D first as a hotfix; Strategy A (Rate Limiting) was identified retrospectively as the most critical first-line defense.

Strategy A: Rate Limiting (First-Line Defense)

Rate limiting at the gRPC receiver layer provides the earliest protection:

Per-connection token bucket: Limit log messages per second per Runner stream
Immediate rejection: Drop excessive messages with zero memory allocation

Strategy B: Buffering & Batching

Replaced “write-on-receive” with in-memory buffers.

Design Logic:

Maintain per-task log buffer in memory
On log receive → push to buffer
Flush triggers (whichever comes first):
- Size threshold: Buffer reaches N entries (e.g., 50)
- Time threshold: T seconds since last flush (e.g., 2s)
Use bulk insert: INSERT INTO logs VALUES (...), (...), (...)
Time-based flush via background goroutine to handle low-frequency logs

Production Considerations:

Use a fixed-size worker pool for DB writes — avoid spawning unbounded goroutines
Implement backpressure: when workers are busy, block or drop new flushes
Use sharded locks or per-task channels to reduce lock contention
Implement task cleanup when Runner terminates to prevent memory leaks

Strategy C: Memory Protection

Prevent memory exhaustion from oversized payloads:

Design Logic:

String truncation: Limit log message to 256 characters
Buffer overflow guard: If buffer exceeds threshold, drop oldest entries

Production Considerations:

Consider ring buffer instead of slice shifting for O(1) operations and stable memory
Implement global memory budget, not just per-task limits
- Example: 100 entries × 10,000 tasks = 1M entries in memory — still dangerous
Consider per-user quotas in addition to per-task limits

Strategy D: Execution Timeout

Added a hard timeout for user script execution. If a script runs longer than the allowed duration, the Runner Pod is terminated. This limits the attack window — even if an infinite loop starts flooding logs, the damage is bounded by time.

Limitations & Future Improvements

💡 The strategies above were implemented as a hotfix during an incident. For long-term stability, consider:

Decouple log ingestion: Use message queue (Kafka, NATS) instead of direct DB writes
Separate storage: Logs are secondary data — store in dedicated systems (Elasticsearch, Loki, ClickHouse) instead of transactional RDBMS
Sidecar pattern: Let Kubernetes handle log collection via Fluentd/Fluent-bit

Result

System maintained 100% availability under similar attack conditions
Malicious user’s logs partially lost (buffer overflow), all other users unaffected
Key lesson: backend must have absolute traffic control for user-generated code — never trust the client
Trade-off accepted: dropping secondary data (logs) to preserve core services