cortex
cortex copied to clipboard
`Terminated` state results in unhealthy ingesters
Describe the bug
When shutting down ingesters they get into Terminated
state. This state is considered unexpected
by memberlist resulting in the heartbeat to fail and the instance to be tainted as unhealthy
. This requires manual intervention and thus effectively breaks autoscaling.
To Reproduce Steps to reproduce the behavior:
- Start Cortex v1.15.3 using Helm chart v2.1.0
- Use HPA to scale down Cortex ingesters
Expected behavior Ingesters should scale down and remove themselves from the ring without errors
Environment:
- Infrastructure: EKS
- Deployment tool: Helm chart v2.1.0
Additional Context
Logs
{"caller":"logging.go:76","level":"debug","msg":"GET //ingester/shutdown (301) 73.436µs","traceID":"1bf635dc8c6c3d4e","ts":"2023-11-21T16:41:39.79265651Z"}
{"caller":"lifecycler.go:498","level":"info","msg":"lifecycler loop() exited gracefully","ring":"ingester","ts":"2023-11-21T16:41:39.8043733Z"}
{"caller":"lifecycler.go:811","level":"info","msg":"changing instance state from","new_state":"LEAVING","old_state":"ACTIVE","ring":"ingester","ts":"2023-11-21T16:41:39.804427334Z"}
{"caller":"ingester.go:2586","level":"info","msg":"starting to flush and ship TSDB blocks","ts":"2023-11-21T16:41:39.804546549Z"}
{"caller":"compact.go:519","duration":"234.25592ms","level":"info","maxt":1700582400000,"mint":1700581137870,"msg":"write block","org_id":"fake","ts":"2023-11-21T16:41:40.038875302Z","ulid":"01HFSC4H6WJD5XV7H90F0P6D4V"}
{"block":"01HEQEWTXD8ZKSDDDE9071TP70","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.042351899Z"}
{"block":"01HEQY3KTDAJA0TJHPRCZ0MBQN","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.046284574Z"}
{"block":"01HEQHJBED85MPHZTEXSAS9SYD","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.049584673Z"}
{"block":"01HEQEWW1KRNEBKS9K4Y42RVMT","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.052457795Z"}
{"caller":"truncateMemory","duration":"52.163691ms","level":"info","msg":"Head GC completed","org_id":"fake","ts":"2023-11-21T16:41:40.104683711Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=127.0.0.6:54087","ts":"2023-11-21T16:41:40.10990833Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: cortex-store-gateway-1-1a5d9a43 (timeout reached)","ts":"2023-11-21T16:41:40.892540819Z"}
{"caller":"grpc_logging.go:46","duration":"76.461µs","level":"debug","method":"/grpc.health.v1.Health/Check","msg":"gRPC (success)","ts":"2023-11-21T16:41:40.927996371Z"}
{"caller":"compact.go:519","duration":"1.423570173s","level":"info","maxt":1700584899375,"mint":1700582400000,"msg":"write block","org_id":"fake","ts":"2023-11-21T16:41:41.528432979Z","ulid":"01HFSC4HG89P99GJVSEBSTFP1K"}
{"caller":"truncateMemory","duration":"202.667137ms","level":"info","msg":"Head GC completed","org_id":"fake","ts":"2023-11-21T16:41:41.732417054Z"}
{"caller":"checkpoint.go:100","from_segment":578,"level":"info","mint":1700584899375,"msg":"Creating checkpoint","org_id":"fake","to_segment":579,"ts":"2023-11-21T16:41:41.732951452Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=127.0.0.6:58933","ts":"2023-11-21T16:41:41.979575777Z"}
{"caller":"head.go:1240","duration":"1.523683363s","first":578,"last":579,"level":"info","msg":"WAL checkpoint complete","org_id":"fake","ts":"2023-11-21T16:41:43.256181134Z"}
{"caller":"ingester.go:2368","compactReason":"forced","level":"debug","msg":"TSDB blocks compaction completed successfully","ts":"2023-11-21T16:41:43.256293661Z","user":"fake"}
{"caller":"shipper.go:334","id":"01HFSC4H6WJD5XV7H90F0P6D4V","level":"info","msg":"upload new block","org_id":"fake","ts":"2023-11-21T16:41:43.301936682Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4H6WJD5XV7H90F0P6D4V/chunks/000001","from":"/data/tsdb/fake/thanos/upload/01HFSC4H6WJD5XV7H90F0P6D4V/chunks/000001","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.333067008Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4H6WJD5XV7H90F0P6D4V/index","from":"/data/tsdb/fake/thanos/upload/01HFSC4H6WJD5XV7H90F0P6D4V/index","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.427698215Z"}
{"caller":"shipper.go:334","id":"01HFSC4HG89P99GJVSEBSTFP1K","level":"info","msg":"upload new block","org_id":"fake","ts":"2023-11-21T16:41:43.500269397Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4HG89P99GJVSEBSTFP1K/chunks/000001","from":"/data/tsdb/fake/thanos/upload/01HFSC4HG89P99GJVSEBSTFP1K/chunks/000001","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.660061181Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4HG89P99GJVSEBSTFP1K/index","from":"/data/tsdb/fake/thanos/upload/01HFSC4HG89P99GJVSEBSTFP1K/index","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.856623646Z"}
{"caller":"memberlist_logger.go:74","level":"warn","msg":"Was able to connect to cortex-store-gateway-1-1a5d9a43 but other probes failed, network may be misconfigured","ts":"2023-11-21T16:41:43.890882572Z"}
{"caller":"ingester.go:2279","level":"debug","msg":"shipper successfully synchronized TSDB blocks with storage","ts":"2023-11-21T16:41:43.984722874Z","uploaded":2,"user":"fake"}
{"caller":"ingester.go:2595","level":"info","msg":"finished flushing and shipping TSDB blocks","ts":"2023-11-21T16:41:43.984859001Z"}
{"caller":"lifecycler.go:871","final_sleep":"30s","level":"info","msg":"lifecycler entering final sleep before shutdown","ts":"2023-11-21T16:41:43.985246801Z"}
{"caller":"signals.go:55","level":"info","msg":"=== received SIGINT/SIGTERM ===\n*** exiting","ts":"2023-11-21T16:41:44.816310571Z"}
{"caller":"module_service.go:96","level":"info","module":"ingester-service","msg":"module stopped","ts":"2023-11-21T16:41:44.816429019Z"}
{"caller":"module_service.go:86","level":"debug","module":"server","msg":"stopping","ts":"2023-11-21T16:41:44.816563052Z"}
{"caller":"module_service.go:109","level":"debug","module":"runtime-config","msg":"module waiting for","ts":"2023-11-21T16:41:44.816598457Z","waiting_for":"ingester-service"}
{"caller":"module_service.go:86","level":"debug","module":"runtime-config","msg":"stopping","ts":"2023-11-21T16:41:44.816632226Z"}
{"caller":"module_service.go:96","level":"info","module":"runtime-config","msg":"module stopped","ts":"2023-11-21T16:41:44.816643075Z"}
{"caller":"module_service.go:109","level":"debug","module":"memberlist-kv","msg":"module waiting for","ts":"2023-11-21T16:41:44.816657622Z","waiting_for":"ingester-service"}
{"caller":"module_service.go:86","level":"debug","module":"memberlist-kv","msg":"stopping","ts":"2023-11-21T16:41:44.816672603Z"}
{"caller":"memberlist_client.go:612","level":"info","msg":"leaving memberlist cluster","ts":"2023-11-21T16:41:44.816698917Z"}
{"caller":"module_service.go:96","level":"info","module":"memberlist-kv","msg":"module stopped","ts":"2023-11-21T16:41:45.841625286Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: cortex-distributor-7d7d5b59b8-9t7ks-7824768a (timeout reached)","ts":"2023-11-21T16:41:45.89149416Z"}
{"caller":"memberlist_logger.go:74","level":"info","msg":"Suspect cortex-distributor-7d7d5b59b8-9t7ks-7824768a has failed, no acks received","ts":"2023-11-21T16:41:48.891631962Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:49.804559785Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:54.804679488Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:59.805094041Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:42:04.805275687Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:42:09.805392347Z"}
{"caller":"lifecycler.go:877","level":"debug","msg":"unregistering instance from ring","ring":"ingester","ts":"2023-11-21T16:42:13.986349184Z"}
{"caller":"ingester.go:772","err":"failed to unregister from the KV store, ring: ingester: unexpected state: Terminated","level":"warn","msg":"failed to stop ingester lifecycler","ts":"2023-11-21T16:42:13.986629129Z"}
{"caller":"logging.go:76","level":"debug","msg":"GET /ingester/shutdown (204) 34.185574054s","traceID":"1a7457db41a31f14","ts":"2023-11-21T16:42:13.989845983Z"}
{"caller":"server_service.go:50","level":"info","msg":"server stopped","ts":"2023-11-21T16:42:14.148840428Z"}
{"caller":"module_service.go:96","level":"info","module":"server","msg":"module stopped","ts":"2023-11-21T16:42:14.148922944Z"}
{"caller":"cortex.go:423","level":"info","msg":"Cortex stopped","ts":"2023-11-21T16:42:14.148952283Z"}