quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Cluster did not start properly

Open fulmicoton opened this issue 2 months ago • 1 comments

There might be a race condition. Observed on a started cluster. Possibly caused by postgres not being reachable when the metastore first started.

NAME                                   READY   STATUS             RESTARTS          AGE
postgres-84ddcdbf87-hx9w2              1/1     Running            0                 25h
simian-control-plane-b58cdfc69-98ddz   0/1     CrashLoopBackOff   385 (3m56s ago)   25h
simian-indexer-0                       0/1     CrashLoopBackOff   206 (2m19s ago)   26h
simian-indexer-1                       1/1     Running            0                 25h
simian-indexer-10                      0/1     CrashLoopBackOff   383 (4m19s ago)   25h
simian-indexer-11                      1/1     Running            0                 26h
simian-indexer-12                      0/1     CrashLoopBackOff   260 (2m8s ago)    26h
simian-indexer-2                       0/1     CrashLoopBackOff   385 (3m10s ago)   25h
simian-indexer-3                       0/1     CrashLoopBackOff   264 (5m1s ago)    26h
simian-indexer-4                       0/1     CrashLoopBackOff   262 (3m41s ago)   26h
simian-indexer-5                       0/1     CrashLoopBackOff   206 (105s ago)    26h
simian-indexer-6                       0/1     CrashLoopBackOff   264 (19s ago)     26h
simian-indexer-7                       0/1     CrashLoopBackOff   385 (4m38s ago)   25h
simian-indexer-8                       0/1     CrashLoopBackOff   266 (104s ago)    26h
simian-indexer-9                       0/1     CrashLoopBackOff   242 (4m10s ago)   26h
simian-janitor-5c6558b84c-rq8qn        0/1     OOMKilled          0                 26h
simian-janitor-5c6558b84c-tlf2w        0/1     CrashLoopBackOff   221 (3m24s ago)   14h
simian-metastore-646bc867cf-mmdkm      1/1     Running            0                 26h

Example of the exhaustive logs of a run

2024-06-02 18:28:59.078	2024-06-02T18:28:59.078Z  INFO quickwit_cluster::change: node `simian-indexer-6` has joined the cluster node_id=simian-indexer-6 generation_id=1717352938469772851
2024-06-02 18:28:58.678	2024-06-02T18:28:58.678Z  INFO quickwit_cluster::change: node `simian-indexer-6` has left the cluster node_id=simian-indexer-6 generation_id=1717352852083972609
2024-06-02 18:28:53.878	2024-06-02T18:28:53.878Z  INFO quickwit_cluster::change: node `simian-indexer-5` has left the cluster node_id=simian-indexer-5 generation_id=1717352843047783355
2024-06-02 18:28:19.478	2024-06-02T18:28:19.478Z  INFO quickwit_cluster::change: node `simian-indexer-8` has left the cluster node_id=simian-indexer-8 generation_id=1717352808599976864
2024-06-02 18:28:16.279	2024-06-02T18:28:16.279Z  INFO quickwit_cluster::change: node `simian-indexer-12` has left the cluster node_id=simian-indexer-12 generation_id=1717352805375518164
2024-06-02 18:27:48.879	2024-06-02T18:27:48.878Z  INFO quickwit_cluster::change: node `simian-indexer-4` has left the cluster node_id=simian-indexer-4 generation_id=1717352771444320018
2024-06-02 18:27:39.078	2024-06-02T18:27:39.078Z  INFO quickwit_cluster::change: node `simian-indexer-8` has joined the cluster node_id=simian-indexer-8 generation_id=1717352808599976864
2024-06-02 18:27:39.078	2024-06-02T18:27:39.078Z  INFO quickwit_cluster::change: node `simian-indexer-6` has joined the cluster node_id=simian-indexer-6 generation_id=1717352852083972609
2024-06-02 18:27:39.078	2024-06-02T18:27:39.078Z  INFO quickwit_cluster::change: node `simian-indexer-5` has joined the cluster node_id=simian-indexer-5 generation_id=1717352843047783355
2024-06-02 18:27:39.078	2024-06-02T18:27:39.078Z  INFO quickwit_cluster::change: node `simian-indexer-4` has joined the cluster node_id=simian-indexer-4 generation_id=1717352771444320018
2024-06-02 18:27:39.078	2024-06-02T18:27:39.078Z  INFO quickwit_cluster::change: node `simian-indexer-12` has joined the cluster node_id=simian-indexer-12 generation_id=1717352805375518164
2024-06-02 18:27:39.078	2024-06-02T18:27:39.078Z  INFO quickwit_cluster::grpc_gossip: no peer nodes to pull the cluster state from
2024-06-02 18:27:38.678	2024-06-02T18:27:38.678Z  INFO quickwit_cluster::change: node `simian-indexer-0` has joined the cluster node_id=simian-indexer-0 generation_id=1717352858676324553
2024-06-02 18:27:38.677	2024-06-02T18:27:38.677Z  INFO quickwit_serve: connecting to metastore
2024-06-02 18:27:38.676	2024-06-02T18:27:38.676Z  INFO quickwit_cluster::cluster: joining cluster cluster_id=simian-simian node_id=simian-indexer-0 generation_id=1717352858676324553 enabled_services={Indexer} gossip_listen_addr=0.0.0.0:7282 gossip_advertise_addr=10.232.17.5:7282 grpc_advertise_addr=10.232.17.5:7281 peer_seed_addrs=simian-headless:7282
2024-06-02 18:27:38.675	2024-06-02T18:27:38.674Z  INFO quickwit_common: setting `QW_DISABLE_TOKIO_LIFO_SLOT` from default value=false
2024-06-02 18:27:38.674	2024-06-02T18:27:38.674Z  INFO quickwit_telemetry::sender: telemetry to quickwit is disabled
2024-06-02 18:27:38.674	2024-06-02T18:27:38.674Z  INFO quickwit_cli::service: setting services from override services=indexer
2024-06-02 18:27:38.674	2024-06-02T18:27:38.674Z  INFO quickwit_cli: loaded node config config_uri=file:///quickwit/node.yaml config=NodeConfig { cluster_id: "simian-simian", node_id: NodeId("simian-indexer-0"), enabled_services: {Metastore, Indexer, Janitor, ControlPlane, Searcher}, gossip_listen_addr: 0.0.0.0:7282, grpc_listen_addr: 0.0.0.0:7281, gossip_advertise_addr: 10.232.17.5:7282, grpc_advertise_addr: 10.232.17.5:7281, gossip_interval: 200ms, peer_seeds: ["simian-headless"], data_dir_path: "/quickwit/qwdata", metastore_uri: Uri { uri: "file:///quickwit/qwdata/indexes#polling_interval=30s" }, default_index_root_uri: Uri { uri: "s3://quickwit-indexes/simian1/" }, rest_config: RestConfig { listen_addr: 0.0.0.0:7280, cors_allow_origins: [], extra_headers: {} }, grpc_config: GrpcConfig { max_message_size: 21.0 MB }, storage_configs: StorageConfigs([S3(S3StorageConfig { access_key_id: None, secret_access_key: None, region: Some("us-east1"), endpoint: Some("https://storage.googleapis.com"), force_path_style_access: false, disable_multi_object_delete: true })]), metastore_configs: MetastoreConfigs([]), indexer_config: IndexerConfig { split_store_max_num_bytes: 149.0 GB, split_store_max_num_splits: 1000, max_concurrent_split_uploads: 12, max_merge_write_throughput: Some(100.0 MB), merge_concurrency: 2, enable_otlp_endpoint: false, enable_cooperative_indexing: true, cpu_capacity: CpuCapacity(3955) }, searcher_config: SearcherConfig { aggregation_memory_limit: 500.0 MB, aggregation_bucket_limit: 65000, fast_field_cache_capacity: 1000.0 MB, split_footer_cache_capacity: 500.0 MB, partial_request_cache_capacity: 64.0 MB, max_num_concurrent_split_searches: 100, max_num_concurrent_split_streams: 100, split_cache: None }, ingest_api_config: IngestApiConfig { max_queue_memory_usage: 2.1 GB, max_queue_disk_usage: 4.3 GB, replication_factor: 1, content_length_limit: 10.5 MB }, jaeger_config: JaegerConfig { enable_endpoint: true, lookback_period_hours: 72, max_trace_duration_secs: 3600, max_fetch_spans: 10000 } }
2024-06-02 18:27:38.673	2024-06-02T18:27:38.673Z  INFO quickwit_cli::service: quickwit version: 0.8.0 (x86_64-unknown-linux-gnu 2024-05-31T05:26:31Z c49bd58)

fulmicoton avatar Jun 03 '24 00:06 fulmicoton