Defending Against Database DoS from Untrusted Code

How we prevented log flooding attacks from user-defined code by implementing buffering, batching, and rate limiting.

Situation

System Architecture

  • 🏃 Runner Pod:
    • Executes user Python scripts (untrusted environment)
    • Isolated per user
    • Managed as Kubernetes Jobs
  • 🎛️ Manager:
    • Creates/deletes Runner Jobs via Kubernetes API
    • Receives logs via gRPC stream
  • 🗄️ Database:
    • Stores execution logs for user queries

The Incident

A user’s script contained an infinite loop that called the logging function at extremely high frequency.

The original design used synchronous writes — every gRPC log message triggered an immediate database INSERT. This caused:

  1. Connection pool exhaustion: Write requests saturated all available connections
  2. Disk I/O saturation: High-frequency small writes overwhelmed storage
  3. Platform-wide outage: All services depending on the database experienced timeouts (DoS)

Task

Redesign the log ingestion pipeline with these goals:

  • 🛡️ Protect the DB: Single user’s misbehavior must not exhaust database resources
  • 🔒 Isolation: Bad actors only affect themselves (lost logs), not other users
  • 🔌 Decoupling: Manager must treat Runner as untrusted; defense lives server-side

Action

We replaced synchronous writes with asynchronous batch processing.

💡 The strategies below are ordered by defense priority (outer to inner layer). During the incident, we implemented B→C→D first as a hotfix; Strategy A (Rate Limiting) was identified retrospectively as the most critical first-line defense.

Strategy A: Rate Limiting (First-Line Defense)

Rate limiting at the gRPC receiver layer provides the earliest protection:

  • Per-connection token bucket: Limit log messages per second per Runner stream
  • Immediate rejection: Drop excessive messages with zero memory allocation

Strategy B: Buffering & Batching

Replaced “write-on-receive” with in-memory buffers.

Design Logic:

  1. Maintain per-task log buffer in memory
  2. On log receive → push to buffer
  3. Flush triggers (whichever comes first):
    • Size threshold: Buffer reaches N entries (e.g., 50)
    • Time threshold: T seconds since last flush (e.g., 2s)
  4. Use bulk insert: INSERT INTO logs VALUES (...), (...), (...)
  5. Time-based flush via background goroutine to handle low-frequency logs

Production Considerations:

  • Use a fixed-size worker pool for DB writes — avoid spawning unbounded goroutines
  • Implement backpressure: when workers are busy, block or drop new flushes
  • Use sharded locks or per-task channels to reduce lock contention
  • Implement task cleanup when Runner terminates to prevent memory leaks

Strategy C: Memory Protection

Prevent memory exhaustion from oversized payloads:

Design Logic:

  • String truncation: Limit log message to 256 characters
  • Buffer overflow guard: If buffer exceeds threshold, drop oldest entries

Production Considerations:

  • Consider ring buffer instead of slice shifting for O(1) operations and stable memory
  • Implement global memory budget, not just per-task limits
    • Example: 100 entries × 10,000 tasks = 1M entries in memory — still dangerous
  • Consider per-user quotas in addition to per-task limits

Strategy D: Execution Timeout

Added a hard timeout for user script execution. If a script runs longer than the allowed duration, the Runner Pod is terminated. This limits the attack window — even if an infinite loop starts flooding logs, the damage is bounded by time.


Limitations & Future Improvements

💡 The strategies above were implemented as a hotfix during an incident. For long-term stability, consider:

  • Decouple log ingestion: Use message queue (Kafka, NATS) instead of direct DB writes
  • Separate storage: Logs are secondary data — store in dedicated systems (Elasticsearch, Loki, ClickHouse) instead of transactional RDBMS
  • Sidecar pattern: Let Kubernetes handle log collection via Fluentd/Fluent-bit

Result

  • System maintained 100% availability under similar attack conditions
  • Malicious user’s logs partially lost (buffer overflow), all other users unaffected
  • Key lesson: backend must have absolute traffic control for user-generated code — never trust the client
  • Trade-off accepted: dropping secondary data (logs) to preserve core services
Licensed under CC BY-NC-SA 4.0
comments powered by Disqus