thanos Receiver: Running with number of replicas lower than replication factor in a hashring is accepted

Thanos, Prometheus and Golang version used: 0.28.0-rc.0

What happened: I was trying out the new RC locally with --receive.hashrings-algorithm=ketama, 6 replicas with replication factor 3. During my tests, some of my replicas were never able to get into a ready state.

After more digging I found out it occurs when my setup got into a state where number of endpoints in the hashring was lower than replication factor. I think there is twofold problem here, depending on which hashing algorithm is used:

In my case I used --receive.hashrings-algorithm=ketama. This caused the hashring creation logic to get into an infinte loop, since we're not able to pre-calculate replica for sections. This causes the hashring change channel to block forever and to never obtain hashring configuration, meaning although receiver is running, it will never become ready without storage being initialized
In case of using the default hashmod algorithm, this issue might not be so obvious, since we're not doing such pre-calculation. However, it still would mean some replication requests are landing on same nodes, which is not a desired behavior

What you expected to happen: I'd expect receiver not to hang forever (in case of Ketama algorithm) and to handle configuration where the replication factor cannot be guaranteed (e.g. log an error, exit receiver).

How to reproduce it (as minimally and precisely as possible):

Create hashring configuration with 2 endpoints
Create a receiver setup with--receive.hashrings-algorithm=ketama and --receive.replication-factor=3
Watch the receiver replicas never becoming ready

Anything else we need to know: I noticed when I was experimenting with https://github.com/observatorium/thanos-receive-controller, which automatically changes the hashring, but users could hit this issue even with erroneous hashring config files

Aug 24 '22 14:08 matej-g

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Nov 13 '22 15:11 stale[bot]

Hi @matej-g
I seems like meet symptoms as your. Did you have get resolved? Could you shared parameters of receiver and hashing?

Feb 24 '23 14:02 JayChanggithub

Created #6168 for it now.

Feb 27 '23 16:02 MichaHoffmann

Was there any fix for this issue ? I am also experiencing the same with 0.32.5 Version of thanos.

Jan 12 '24 23:01 dmilind

Was there any fix for this issue ? I am also experiencing the same with 0.32.5 Version of thanos.

Are you experiencing deadlock or does receiver fail to start with an error?

Jan 13 '24 09:01 MichaHoffmann

thanos thanos copied to clipboard

Receiver: Running with number of replicas lower than replication factor in a hashring is accepted

thanos
thanos copied to clipboard