dskit ring: Members of the ring stay `unhealthy` after leaving the ring.

trafficstars

Problem

Recently we started seeing lots of "unhealthy" instances on the distributor ring in Loki clusters during normal rollout. These "unhealthy" members are the ones who already left the ring but they failed to unregister themselves from the underlying KV store (consul in this case).

Did some investigation and would like record the behavior here.

Investigation

So what happens when a member of the ring leaves?

It needs to unregister() itself from the underlying KV store. And this unregister basically means removing its instance-id from the shared key value called collectors/distributor. (or any other key based on the components)

This key value collector/distributor is been shared among all the members of the ring. And every member needs to update it before successfully leaving the ring.

This shared access to the collectors/distributor is provided by something called CAS access. It provides some sort of synchronization mechanism for the key value shared by multiple members of the ring.

The way CAS (Check-And-Set) works is based on current ModifyIndex, a metadata about the key value you are tyring to set(or update).

This index is monotonically increasing counter that get updated every time any member update this particular kv.

This CAS operation is similar to PUT operation except, it takes additional param called cas, an integer representing ModifyIndex of the key value you are trying to update.

Two cases.

if cas is set to 0 (or leave it empty) then the kv is updated only if it doesn't exist before.
but if cas is set to non-zero value, then the kv is updated ony if that value of the cas matches with current ModifyIndex of the kv.

e.g; (I ran it in our clusters)

curl -s http://127.0.0.1:8500/v1/kv/collectors/distributor | jq .[0].ModifyIndex
8149486

Here modifyindex is 8149486. If I try to cas with lesser value, then it returns false.

curl -s --request PUT --data 10 http://127.0.0.1:8500/v1/kv/collectors/distributor?cas=100 && echo
false

This is what happening in our distributor ring. Some "LEAVE"ing distributor, try max of 10 times to set the value with ModifyIndex that they got previously. But meanwhile some other distributor would have updated it, making that ModifyIndex out of date. Thus making these distributors ran out of retries.

Proposal

The ideal solution is bit tricky. As we need some kind of synchronisation with same key:value collectors/distributors shared by all the members of the ring. which is what ModifyIndex gives us anyway.

So tweaking the MaxCasRetry (and/or CasRetryDelay) configs would be good one IMHO. Unfortunately, dskit currently doesn't accept to tweak this retry arguments. But I think we can make it configurable. And setting retry to same as number of replicas always resulted in no unhealthy members (tested it with even 100 distributor replicas)

So I think it's better

Make the flags configurable
Document this behavior in the ring and kv packages.

NOTES

In case of cas returning false (because of unmatched ModifyIndex), it will be treated as non-error case. e.g this handled in the dskit code here
The default retry of 10 times may work for ingester (it has expensive and time-consuming gracefull shutdown process). So most likely, one of the CAS retry would succeed. But not the case for distributor, it can shutdown almost instantly and thus high chance lots of distributors leaving the ring, accessing the shared key value at the same time.
While investigating, I also found the error returned by unregister operation will be lost if ring is wrapped in BasicService and the service doesn't have any listeners.
CAS doc from consul website.

Any thoughts, suggestions or criticisms are welcome!

Jan 23 '22 21:01 kavirajk

I think the retries and delay should be configurable but I would hope for better solution, although I don't have an idea.

/cc @pstibrany ?

Jan 24 '22 07:01 cyriltovena

Does having a lot of pods shut down at the same time not cause other problems, like difficulty updating routing for the Service?

How many are we talking, in what time?

How come the replacement pods can insert themselves in the ring and become ready at the same rate?

Jan 24 '22 08:01 bboreham

Does having a lot of pods shut down at the same time not cause other problems, like difficulty updating routing for the Service?

No, not really (AFAIS). Only problem we saw this retry loop completed without being successful.

How many are we talking, in what time?

We saw 22 unhealthy members in cluster with 50 distributor replicas. Happens during normal rollout (no OOM or any unexpected crashloopback)

How come the replacement pods can insert themselves in the ring and become ready at the same rate?

good point. Not sure. will check how member register themselves and come back to you.

Jan 24 '22 08:01 kavirajk

Do you have minReadySeconds set on the manifest?

Jan 24 '22 08:01 bboreham

How come the replacement pods can insert themselves in the ring and become ready at the same rate?

I think that's because we have initRing called in the main loop of the ring itself? (so say even if you remove entries from underlying consul or consul restarted, these members try register during the loop cycle)

But not the same case with unregister().

Jan 24 '22 08:01 kavirajk

If the program doesn’t become ready until registered, and if you have minReadySeconds set, then the rollout will slow down until the additions do succeed.

I’m still interested to know in what time those 22 pods failed to update. Or perhaps: what is the overall rate of CAS operations during the rollout?

Jan 24 '22 08:01 bboreham

I’m still interested to know in what time those 22 pods failed to update. Or perhaps: what is the overall rate of CAS operations during the rollout?

@kavirajk can you share some metrics on that.

If the program doesn’t become ready until registered, and if you have minReadySeconds set, then the rollout will slow down until the additions do succeed.

Yeah I think tweaking the deployment could be useful to slow this down and avoid it. We don't need it to go super fast. Can we try this too.

Jan 24 '22 10:01 cyriltovena

I guess different tradeoff here. Tried following cases to get some metrics about CAS rate.

In normal rollout, the CAS operations rate reaches almost 700 reqs per sec. (with 100 replicas).

ZqUuCy4

Used MaxSurge and MaxUnAvailable as 25%.

This resulted in around 10 uhealthy members. Rollout is complete under 2 mins.

With following k8 configs MaxSurge: 5 MaxUnavailable: 1

The CAS rate is much lower (30 reqs per sec) Screenshot_2022-01-24_14-16-37

The problem is rollout is very slow it took more than 10 minutes for the rollout to complete. (But no unhealthy members)

Jan 24 '22 13:01 kavirajk

I guess we can play with these k8 configs to reach good default values for distributors, but looks like we always have to pay some rollout time as a cost.

Jan 24 '22 13:01 kavirajk

Here is a another one. Keep k8 configs same (maxSurge and maxUnavailable to 25%) But just increase the retry from 10 -> 100.

Rollout tooks same ~2mins, No Unhealthy. But CAS rate increased almost 2x (1500 reqs per sec) at peak.

Screenshot_2022-01-24_14-31-17

Jan 24 '22 13:01 kavirajk

On the one side we want no unhealthy members (easy of operation) Other side we want not to overload consul with high CAS rate Also we need to be careful about Rollout time.

These are the contradicting goal, and we need to find some sweet spot with some sane config values in k8 and CAS retries.

Finally I will also see how having backoff + jitter can affect the CAS retries (currently our consul KV client doesn't support any backoff)

Jan 24 '22 13:01 kavirajk

dskit dskit copied to clipboard

ring: Members of the ring stay `unhealthy` after leaving the ring.

Problem

Investigation

Proposal

NOTES

dskit
dskit copied to clipboard