Large Cluster State Renders Cluster Unresponsive

Open esatterwhite opened this issue 1 year ago • 1 comments

Describe the bug Crashing nodes accumulate in cluster state as dead nodes for a long while. As the state grows it takes longer and more memory to synchronize across the cluster. Particularly when a new node is added (or restarts). We have seen initial sync times upwards of 20+ seconds when the cluster is under stress And the cluster state being in the 5-10MB range. In many cases, this will cause kubernetes to kill the pod causing another restart. Which adds another dead + new node to the state.

This constant start up loop also puts a lot of pressure on the control plane and it can become slow to respond - At which point the cluster becomes mostly unresponsive.

We have seen this problem in situations when the database server (postgres) is unavailable or unresponsive for a period of time. It being down causes all things in the write path to crash repeatedly which leads to the growing state problem. The only course of action seems to be to stop everything / scale all components to 0 so there is no starting cluster state.

Steps to reproduce (if applicable) Steps to reproduce the behavior: Our ingestion setup is 3 metastore nodes, 1 postgres primary, 30 Indexer pods, and 1 control plane node. We typically are doing 500MB/s ingest across 3K indexes.

After running for some time, kill the postgres instance and let be down for 10 minutes to allow the cluster state accumulate a few hundred dead nodes. Restart postgres wait for ingestion to become healthy again.

Expected behavior If it isn't really feasible to prune the dead node list or size of the state because of the details around scuttlebut, the cluster should be much more tolerable of the database being down. Crashing may not be the best course of action as it ultimately prevents the cluster from ever becoming healthy again.

Configuration:

// build info
{
  "build": {
    "build_date": "2024-09-19T16:35:19Z",
    "build_profile": "release",
    "build_target": "aarch64-unknown-linux-gnu",
    "cargo_pkg_version": "0.8.0",
    "commit_date": "2024-09-19T15:34:03Z",
    "commit_hash": "f3dd8f27129a418a0d30642f8e7da2235d37cfaf",
    "commit_short_hash": "f3dd8f2",
    "commit_tags": [
      "qw-airmail-20240919"
    ],
    "version": "0.8.0-nightly"
  },
  "runtime": {
    "num_cpus": 16,
    "num_threads_blocking": 14,
    "num_threads_non_blocking": 2
  }
}

-- node.yaml
version: '0.8'
listen_address: 0.0.0.0
grpc:
  max_message_size: 100MiB # default 20 MiB
metastore:
  postgres:
    max_connections: 50
searcher:
  fast_field_cache_capacity: 4G # default 1G
  aggregation_memory_limit: 1000M # default 500M
  partial_request_cache_capacity: 64M # default 64M
  max_num_concurrent_split_searches: 20  # default 100
  split_cache:
    max_num_bytes: 7500G
    max_num_splits: 1000000
    num_concurrent_downloads: 6
indexer:
  enable_cooperative_indexing: true
  cpu_capacity: 1000m
  merge_concurrency: 3
ingest_api:
  max_queue_disk_usage: 8GiB # default 2Gib
  max_queue_memory_usage: 4GiB # default 1Gib
  shard_throughput_limit: 5mb #default 5mb
storage:
  s3:
    region: us-east-1
    endpoint: https://s3.amazonaws.com

Sep 24 '24 19:09 esatterwhite

related to #5135

Sep 24 '24 20:09 esatterwhite