thanos
thanos copied to clipboard
Receiver: Running with number of replicas lower than replication factor in a hashring is accepted
Thanos, Prometheus and Golang version used:
0.28.0-rc.0
What happened:
I was trying out the new RC locally with --receive.hashrings-algorithm=ketama, 6 replicas with replication factor 3. During my tests, some of my replicas were never able to get into a ready state.
After more digging I found out it occurs when my setup got into a state where number of endpoints in the hashring was lower than replication factor. I think there is twofold problem here, depending on which hashing algorithm is used:
- In my case I used
--receive.hashrings-algorithm=ketama. This caused the hashring creation logic to get into an infinte loop, since we're not able to pre-calculate replica for sections. This causes the hashring change channel to block forever and to never obtain hashring configuration, meaning although receiver is running, it will never become ready without storage being initialized - In case of using the default hashmod algorithm, this issue might not be so obvious, since we're not doing such pre-calculation. However, it still would mean some replication requests are landing on same nodes, which is not a desired behavior
What you expected to happen: I'd expect receiver not to hang forever (in case of Ketama algorithm) and to handle configuration where the replication factor cannot be guaranteed (e.g. log an error, exit receiver).
How to reproduce it (as minimally and precisely as possible):
- Create hashring configuration with 2 endpoints
- Create a receiver setup with
--receive.hashrings-algorithm=ketamaand--receive.replication-factor=3 - Watch the receiver replicas never becoming ready
Anything else we need to know: I noticed when I was experimenting with https://github.com/observatorium/thanos-receive-controller, which automatically changes the hashring, but users could hit this issue even with erroneous hashring config files
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Hi @matej-g
I seems like meet symptoms as your. Did you have get resolved? Could you shared parameters of receiver and hashing?
Created #6168 for it now.
Was there any fix for this issue ? I am also experiencing the same with 0.32.5 Version of thanos.
Was there any fix for this issue ? I am also experiencing the same with 0.32.5 Version of thanos.
Are you experiencing deadlock or does receiver fail to start with an error?