mimir icon indicating copy to clipboard operation
mimir copied to clipboard

ingesters stuck in LEAVING state when increasing their number of tokens

Open agardiman opened this issue 2 years ago • 2 comments

Describe the bug

When I change the CLI option -ingester.ring.num-tokens , the pods leave the ring with the state LEAVING and when they restart, they detect that their entry is already present in the ring, and they get stuck in that state, never reaching ACTIVE. This is when the ingesters are set to NOT deregister form the ring at shutdown.

To Reproduce

Steps to reproduce the behavior:

  1. Start (Mimir 2.4.0) ingesters with -ingester.ring.num-tokens=64
  2. change -ingester.ring.num-tokens to something bigger, even just 65
  3. the new ingesters will never come up as ACTIVE

Expected behavior

The new ingester detect that they they already have some tokens locally but they should have more, so they produce the missing tokens, insert them in the ring and reach the ACTIVE state and start ingesting metrics.

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: jsonnet
  • Mimir version: 2.4.0

Additional Context

The following are the logs from an afffected instance

➜  ~ kubectl logs -f ingester-zone-a-0
level=info ts=2022-12-21T10:38:44.197631625Z caller=main.go:210 msg="Starting application" version="(version=2.4.0, branch=HEAD, revision=32137ee)"
level=info ts=2022-12-21T10:38:44.198422782Z caller=server.go:306 http=[::]:80 grpc=[::]:9095 msg="server listening on addresses"
...
level=info ts=2022-12-21T10:38:44.204627871Z caller=memberlist_client.go:436 msg="Using memberlist cluster label and node name" cluster_label= node=ingester-zone-a-0-1abc387f
...
level=info ts=2022-12-21T10:38:44.211937824Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=9 to_join=4
level=info ts=2022-12-21T10:38:44.219120657Z caller=memberlist_client.go:563 msg="memberlist fast-join finished" joined_nodes=4 elapsed_time=13.697477ms
level=info ts=2022-12-21T10:38:44.219183764Z caller=memberlist_client.go:576 msg="joining memberlist cluster" join_members=dns+gossip-ring.cortex.svc.cluster.local:7946
level=info ts=2022-12-21T10:38:44.236074303Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=9 elapsed_time=16.890905ms
...
level=info ts=2022-12-21T10:38:44.400459071Z caller=module_service.go:82 msg=initialising module=memberlist-kv
...
level=info ts=2022-12-21T10:38:44.523936265Z caller=mimir.go:762 msg="Application started"
level=info ts=2022-12-21T10:38:44.525500762Z caller=lifecycler.go:612 msg="existing entry found in ring" state=LEAVING tokens=64 ring=ingester

Just as an experiment, if I instead set the option to unregister at shutdown, the instance unregister form the ring, then it finds the old 64 tokens from the file system, add the remaining tokens (leaving the old 64 untouched) and registering again with the old 64 tokens plus the new ones.

agardiman avatar Dec 22 '22 09:12 agardiman

I think you've hit this issue: https://github.com/grafana/dskit/issues/73

If so, it's a known issue. There was some work on it back in time https://github.com/grafana/dskit/pull/79 but we haven't got the time to follow up on it yet.

pracucci avatar Dec 22 '22 16:12 pracucci

It seems the same issue indeed! I left some comments on the PR. If there is no one working on it, I'm happy to help.

agardiman avatar Dec 23 '22 10:12 agardiman