thanos-receive-controller icon indicating copy to clipboard operation
thanos-receive-controller copied to clipboard

Add feature to wait on ready replicas on scaling up

Open matej-g opened this issue 2 years ago • 3 comments

Based on work done in #89

This change adds a flag --allow-only-ready-replicas that changes the behavior of controller on a scale up - if enabled, the controller will first wait on all replicas to be ready before adding them to the hashring. The feature is documented as well.

matej-g avatar Aug 26 '22 13:08 matej-g

Hi,

I have build an image from this PR, and unfortunately the controller doesnt seems to behave as expected. What I am expected: While scaling up my receive STS, the controller updates de configmaps only when replicas are ready

My architecture looks like this: thanos distributor -> n* thanos receive ingester. The hashring file is present only on the distributor and is managed by the thanos receive controller, my replication.factor is at 3.

My controller config:

  containers:
  - args:
    - --configmap-name=thanos-receive-base
    - --configmap-generated-name=thanos-receive-generated
    - --file-name=hashrings.json
    - --namespace=$(NAMESPACE)
    - --allow-only-ready-replicas=true

I did a scale up from 15 to 18 replicas, so directly 3 replicas in one update and my service went down with the errors below:

level=debug ts=2022-09-01T12:15:31.639283401Z caller=handler.go:555 component=receive component=receive-handler tenant=lgcy msg="failed to handle request" err="2 errors: replicate write request for endpoint thanos-test-receiver-16.thanos-test-receiver-headless.meta-monitoring-test:10901: quorum not reached: target not available; replicate write request for endpoint thanos-test-receiver-15.thanos-test-receiver-headless.meta-monitoring-test:10901: quorum not reached: target not available"

The configmap generated by the controller has been update with all the pods including the not ready ones and also the not created ones.

Did I miss something in my test? Thank in advance.

lud97x avatar Sep 01 '22 13:09 lud97x

My controller's serviceaccount didnt have read permission on the pod resources. After I have edited the the controller role it is working fine:

  - apiGroups: 
    - ""
    resources: 
    - "pods"
    verbs:
    - "get"
    - "watch"
    - "list"

lud97x avatar Sep 01 '22 14:09 lud97x

So I have tried again and this Pr doesn't work. I have build a image based on https://github.com/observatorium/thanos-receive-controller/pull/89 and it works as expected. I have noticed than in your PR some part was missing, for example https://github.com/michael-burt/thanos-receive-controller/blob/allow-only-ready-replicas/main.go#L595

lud97x avatar Sep 02 '22 12:09 lud97x

So I have tried again and this Pr doesn't work. I have build a image based on #89 and it works as expected. I have noticed than in your PR some part was missing, for example https://github.com/michael-burt/thanos-receive-controller/blob/allow-only-ready-replicas/main.go#L595

Hey @lud97x thanks for taking the time to try this out! So after you adjusted the service account permissions, did it work?

The reason for the difference is stated in the README update I have added as part of this PR. I removed that part because scaling on every pod (un)readiness could potentially lead to a frequent hashring changes, see the explanation in the README in my PR.

matej-g avatar Sep 08 '22 12:09 matej-g