alpha

Vector

Vector sink resilience patterns for ClickHouse and disk-buffered pipelines.

#vector#clickhouse#resilience#systemd

ClickHouse Sink Failure Cascade

When ClickHouse becomes unavailable (restart, maintenance), Vector’s default configuration triggers three cascading failures:

  1. File descriptor exhaustion — Retry attempts accumulate open HTTP connections. Vector hits the OS limit (EMFILE, errno 24) and the http_server source crashes with Too many open files.
  2. Event loss from backpressure — Default in-memory buffer fills, backpressure propagates to sources. HTTP sources drop incoming events with "Source send cancelled". No replay possible.
  3. Process crash — The http_server source task exits fatally, taking Vector down.

Raise File Descriptor Limit

Vector MUST have a raised file descriptor limit in systemd. Default is too low for retry storms.

# /etc/systemd/system/vector.service.d/override.conf
[Service]
LimitNOFILE=262144

Verify on running process:

cat /proc/$(pidof vector)/limits | grep "open files"

Disk Buffers on All Sinks

Every sink MUST use disk buffers. In-memory buffers lose events on restart and cause backpressure cascades.

sinks:
  clickhouse_from_files:
    buffer:
      type: "disk"
      max_size: 5368709120  # 5GB
      when_full: "block"

  clickhouse_from_http:
    buffer:
      type: "disk"
      max_size: 5368709120  # 5GB
      when_full: "drop_newest"

Minimum disk buffer size: 256MB (268435488 bytes). Data syncs to disk every 500ms — survives forced restarts.

block vs drop_newest

The when_full strategy depends on whether the source supports replay.

  • File sources (type: file) → block. Vector tracks read position via checkpoints. When the sink recovers, it resumes from where it stopped. No data loss.
  • HTTP sources (type: http_server) → drop_newest. No replay mechanism. Dropping new events is better than blocking the source and triggering "Source send cancelled" on all incoming requests.