thanos-receive-controller receive + receive controller: Eliminate downtime when scaling up/down hashring replicas.

We hit cases when introducing more replicas, Thanos controller updates the hashing which makes receive ring is unstable due to one node being expected but down. We need to find a way that improves this state, it's quite fragile at the moment.

Mitigation: Turn of thanos-receive-controller and increase replicas, then turn on controller back.

Mar 18 '21 16:03 bwplotka

Thanks @bwplotka . What we ended up doing was something a little cruder (since this was just a test env). We stopped all receivers. rm -rf ed the recv PVs (yes, we lost 2 hours of data that was not yet persisted in Obj store) and the restarted receivers with higher replicas. It seemed to work more efficiently (less memory + cpu in aggregate) given the same workload. Will try your suggestion and see how it works. But being able to increase the number of replicas on the fly dynamically is a real need ofcourse. BTW @bwplotka do we have any recommendations on running odd vs even number of replicas.

Mar 18 '21 17:03 bjoydeep

https://github.com/observatorium/thanos-receive-controller/pull/70 might help :)

Mar 23 '21 05:03 spaparaju

We hit the same kind of issue when terminating a k8s node which is hosting replicas to finally loose the quorum. We use a "Chaosmonkey" script that terminate randomly 1 ec2 instance per day from our EKS cluster.

It took approximately 30 minutes for the quorum to be restored (no manual actions).

Logs

level=error ts=2022-01-03T15:48:28.468568897Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:00:15.101622584Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:03:05.711160692Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"
 
level=error ts=2022-01-03T16:07:22.526307825Z caller=handler.go:366 component=receive component=receive-handler err="2 errors: replicate write request for endpoint thanos-receive-default-receivers-16.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-default-receivers-17.thanos-receive-default-receivers.receivers.svc.cluster.local:10901: conflict" msg="internal server error"

I see 2 things here :

eliminate or reduce downtime when there are movements of pods, like scaling (this issue)
identify primary receivers of the quorum to schedule them on different nodes and also forward to a live primary (I can maybe create an other issue)

(I miss knowledge on how it's working internally)

Feb 02 '22 17:02 r0mdau

If quorum is lost, does the Receiver stop ingesting samples all together? Is there a metric which can be used to fire an alert when quorum is lost?

I am struggling to understand best practices around scaling the hash ring. If http_requests_total{code=200"} on the Receiver goes to 0, does this imply that no metrics are being ingested?

May 11 '22 18:05 michael-burt

thanos-receive-controller thanos-receive-controller copied to clipboard

receive + receive controller: Eliminate downtime when scaling up/down hashring replicas.

thanos-receive-controller
thanos-receive-controller copied to clipboard