cortex the ring never removes old ingester even if the ingester pod is evicted

I have a similar problem as #1502 when my ingester pod was evicted , a new ingester pod will be created . now the ring has two ingester, but only one (the new one) is healthy. the old one will not be removed from the ring, even if I delete the evict pod manually. the ring information as follows:

`

                                        <tr>
						<td>ingester-7fc8759d7f-nzb6g</td>
						<td>ACTIVE</td>
						<td>172.16.0.62:9095</td>
						<td>2019-07-19 03:33:32 &#43;0000 UTC</td>
						<td>128</td>
						<td>45.739077787319914%</td>
						<td><button name="forget" value="ingester-7fc8759d7f-nzb6g" type="submit">Forget</button></td>
					</tr>
					
					<tr>
						<td>ingester-7fc8759d7f-wmnms</td>
						<td>Unhealthy</td>
						<td>172.16.0.93:9095</td>
						<td>2019-07-18 14:46:18 &#43;0000 UTC</td>
						<td>128</td>
						<td>54.260922212680086%</td>
						<td><button name="forget" value="ingester-7fc8759d7f-wmnms" type="submit">Forget</button></td>
					</tr>

` and the ingester's status is always unready， with distributor's error

level=warn ts=2019-07-19T03:41:45.413839063Z caller=server.go:1995 traceID=daf4028f530860f msg="POST /api/prom/push (500) 727.847µs Response: \"at least 1 live ingesters required, could only find 0\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 3742; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.0; X-Forwarded-For: 172.16.0.17; X-Forwarded-Host: perf.monitorefk.huawei.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.16.0.17; X-Request-Id: 62a470dc6de7a83c8974e3411fa63e40; X-Scheme: https; X-Scope-Orgid: custom; "

I wonder if there is any solution to deal with the situaton automatically? maybe to check the replicas-refactor and remove unhealthy excess ingesters from the ring?

Jul 19 '19 04:07 wuyafang

If the ingester shut down cleanly, even on eviction, then it would not be in the ring. So, the first task is to find out why it did not shut down cleanly, and if possible fix that.

Everything else you report is deliberate. We return not-ready to halt a rolling update.

Jul 19 '19 05:07 bboreham

Actually I don’t understand could only find 0, since your ring shows 1 active.

Jul 19 '19 05:07 bboreham

what do you mean about "shut down cleanly"?

I try to reproduce the problem by delete pod --force. and a new ingester pod is produced by deployment controller immediately. nowI get an ring has two ingester(one is active,the other is unhealthy)，an unready state for my new ingester(because there is an unhealthy pod, so 503 returned for readiness check), and a distributor log like this
level=warn ts=2019-07-19T08:18:25.511613228Z caller=logging.go:49 traceID=5eabc2f0f6b837e9 msg="POST /api/prom/push (500) 266.759µs Response: \"at least 2 live ingesters required, could only find 1\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5162; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.1; X-Forwarded-For: 100.95.185.106; X-Forwarded-Host: 100.95.137.223; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 100.95.185.106; X-Request-Id: 89b05193a60c7935ac6a7bcd090b9a16; X-Scheme: https; X-Scope-Orgid: primary

I'm confused because my -distributor.replication-factor is 1, so by minSuccess := (replicationFactor / 2) + 1 , my distributor only need at 1 live ingester, but the log tells me I need two.

so is there anything I misunderstood?

I wonder when the ring adds ingester and when to remove? Is consul do it by itself , or ingester tell it what to do? I notice when ingester start and shutdown, it will tell ring. But what if the ingester is shutdown unclearly ,is there any solutions to automatically clean the unhealthy pod in the ring ?

by the way , after I restart my consul , the ring will only have the active one and anything works well.

Jul 19 '19 07:07 wuyafang

I mean the ingester went through its exit sequence, rather than being abruptly terminated from outside.

There are two main cases: hand-over to another ingester, and flush to store. In both cases the time required is a function of how much data is in memory.

When using an explicitly provisioned store (eg DynamoDB) it would be nice to scale up specifically for a “save everything” operation. There’s no code to do that currently.

Jul 19 '19 08:07 bboreham

I try to reproduce the problem by delete pod --force. and a new ingester pod is produced by deployment controller immediately. nowI get an ring has two ingester(one is active,the other is unhealthy)，an unready state for my new ingester(because there is an unhealthy pod, so 503 returned for readiness check), and a distributor log like this level=warn ts=2019-07-19T08:18:25.511613228Z caller=logging.go:49 traceID=5eabc2f0f6b837e9 msg="POST /api/prom/push (500) 266.759µs Response: "at least 2 live ingesters required, could only find 1\n" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5162; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.1; X-Forwarded-For: 100.95.185.106; X-Forwarded-Host: 100.95.137.223; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 100.95.185.106; X-Request-Id: 89b05193a60c7935ac6a7bcd090b9a16; X-Scheme: https; X-Scope-Orgid: primary

I'm confused because my -distributor.replication-factor is 1, so by minSuccess := (replicationFactor / 2) + 1 , my distributor only need at 1 live ingester, but the log tells me I need two.

so is there anything I misunderstood?

I wonder when the ring adds ingester and when to remove? Is consul do it by itself , or ingester tell it what to do? I notice when ingester start and shutdown, it will tell ring. But what if the ingester is shutdown unclearly ,is there any solutions to automatically clean the unhealthy pod in the ring ?

by the way , after I restart my consul , the ring will only have the active one and anything works well.

Jul 19 '19 09:07 wuyafang

I know... you do this

        replicationFactor := r.cfg.ReplicationFactor
	if len(ingesters) > replicationFactor {
		replicationFactor = len(ingesters)
	}

so the replicationFactor is 2 now, instead of what I set in -distributor.replication-factor. it is in case of node joining/leaving， but will cause write failure in my case above.

Jul 19 '19 09:07 wuyafang

That sounds like the same problem as #1290 @tomwilkie can you remember what that check is for? https://github.com/cortexproject/cortex/blob/7cf06900f6fea998724a46a8534ce13ee94c1b67/pkg/ring/replication_strategy.go#L20

Jul 19 '19 09:07 bboreham

Actually If I deploy one ingester and replicationFactor is 1, then ingester pod was evicted because of low memory and kubelet restart another ingester pod.

However the previous ingester didn't exit cleanly, actually the corresponding entry in the ring of consul will never be cleaned. So at this moment:

replicationFactor == len(ingesters) = 2 (1 eviceted ingester and 1 running ingester)

minSuccess = (replicationFactor / 2) + 1 = 2

However len(liveIngesters) = 1 < minSuccess ---> There is is a deadlock: unhealthy ingester never cleaned from the ring and we'll never reach minSuccess.

Two problems here:

Why len(ingester) > replicationFactor, then replicationFactor = len(ingesters). ---> In fact, the ingesters more than original replicationFactor is usually not normal.
If the ingester is unhealthy for a long time, why not clean it out from the ring. ---> The residual ingester will affect the replication strategy.

@bboreham @tomwilkie @csmarchbanks Any ideas ?

Jul 29 '19 13:07 YaoZengzeng

The current design requires that you set terminationGracePeriodSeconds long enough to shut down the first ingester cleanly.

Your point 1 seems the same as #1290

Point 2 because we don't have enough experience of situations that need this. We would probably add it as an option if someone was to submit a PR.

Jul 29 '19 13:07 bboreham

@bboreham If ingester is killed because of OOM (Actually ingester consume a lot memory and it's very common in k8s, at least very common in my k8s environment), then it will never have terminationGracePeriodSeconds to shut down gracefully 😂

For point 1, I think replicationFactor is configured by user, so set it constant maybe more reasonable.

For point 2, I may need to read more code to better understand the design intent. If it's necessary, I'd like to make an PR to fix it.

Jul 29 '19 13:07 YaoZengzeng

"killed because of OOM" is not the same thing as "evicted". A pod that is OOM-killed will restart with the same identity on the same node, hence pick up the same entry in the Cortex ring. Unless you have evidence to the contrary?

Jul 29 '19 14:07 bboreham

@bboreham You are right.

In our environment, the kubelet was configured with hard eviction, so the ingester pod was evicted without graceful period.

However even configure kubelet with soft eviction, I have no idea of how to configure the eviction-max-pod-grace-period. Because the grace period needed by ingester may related to the amount of data it contains.

If after configured grace period, ingester still can't exit cleanly. Then the problem still can't be solved.

Jul 30 '19 01:07 YaoZengzeng

Hi Apparently I ran into the same issue, although I saw msg="error removing stale clients" err="too many failed ingesters" as well in the logs.

Having a look at the code my assumption is the following (please correct me if im wrong):

pkg/ingester/client/pool.go#removeStaleClients() calls to p.ring.GetAll()
pkg/ring/ring.go#GetAll() returns error with too many failed ingesters because of the way how the logic is implemented using maxErrors.

Some questions:

why does GetAll need to check against the non-healthy instances at all?
if that check is needed, what is the purpose behind calculating the maxErrors from the Unhealthy instances instead of calculating the ACTIVE ones and using the ReplicationFactor? https://github.com/cortexproject/cortex/blob/1ca4ad0de1208d574531e27ac7576bfb5c03756c/pkg/ring/ring.go#L258-L286

After changing the aforementioned code (line 278-280) to the following, I stopped receiving "error removing stale client":

if len(ingesters) < r.cfg.ReplicationFactor / 2 + 1 {
    return ReplicationSet{}, fmt.Errorf("not enough healthy ingesters (ingesters: %d, replicationFactor: %d)", len(ingesters), r.cfg.ReplicationFactor)
}

also when I set distributor.health-check-ingesters enabled, the Unhealthy ingesters got cleaned up properly. (so #1264 might be related as well) EXCEPT when it was an OOM or using ECS and your ingester comes back with the same identity (as you mentioned), on the same IP/host. Meaning there's no logic to transition from LEAVING to ACTIVE again. Shouldn't there be some logic created around that?

Note that distributor.client-cleanup-period (def 15s) & distributor.health-check-ingesters (def: false) controls the ingester cleanup, which will clean up Unhealthy ingesters if you have the health-check-ingesters enabled. https://github.com/cortexproject/cortex/blob/1ca4ad0de1208d574531e27ac7576bfb5c03756c/pkg/ingester/client/pool.go#L75-L92

Jul 30 '19 15:07 miklosbarabasForm3

Note that distributor.client-cleanup-period (def 15s) & distributor.health-check-ingesters (def: false) controls the ingester cleanup, which will clean up Unhealthy ingesters if you have the health-check-ingesters enabled.

I tried this. but it just removes unhealthy ingester from distirbutor pool( which holds ingester clients ) instead of removing them from consul ring. It doesn't work for me.

Jul 31 '19 08:07 wuyafang

Don't read too much into the words - that's removing them from one data structure in memory. There is no code to remove ingesters from the ring when they are suspected to be dead, and this was deliberate.

Jul 31 '19 09:07 bboreham

@bboreham what was the idea behind not removing ingesters from the ring when they are suspected to be dead?

and then what is the purpose of removeStaleClients and cleanUnhealhty ? (which is just removing the unhealthy ingesters from the distributor pool only)

Jul 31 '19 09:07 miklosbarabasForm3

what was the idea behind not removing ingesters from the ring when they are suspected to be dead?

Risky, easy to get wrong, not necessary day one.

and then what is the purpose of removeStaleClients

that was to fix #217

Jul 31 '19 12:07 bboreham

Here's an example scenario we want to avoid: Cortex is running under Kubernetes, and a rolling update begins:

Kubernetes sends SIGTERM to one old ingester and starts one new ingester.
A bug in the new code means hand-over from old to new fails.
Terminating ingester starts to flush all data, but can't flush quickly enough and runs out of time.
Kubernetes terminates the first ingester and removes its pod entry.

Now, if we allow the rolling update to proceed, the same thing will happen in each case and we will lose the unflushed data from all ingesters, which could be a significant proportion of all data in the last 12 hours.

With the current code the rolling update is halted because there will be an "unhealthy" entry for the old ingester in the ring, and this means the new ingester will never show "ready" to Kubernetes.

Aug 25 '19 16:08 bboreham

@bboreham Yes, it's exactly the scenario we encountered and it's annoying.

Aug 26 '19 02:08 YaoZengzeng

I think you would find losing half the data more annoying than having to operate the system manually when there is a fault.

Aug 26 '19 06:08 bboreham

I also hit this issue. I was able to work around it by completely wiping the slate clean, but it's not ideal.

Oct 19 '19 20:10 siennathesane

If you indeed hit the same issue please follow the steps in https://github.com/cortexproject/cortex/issues/1521#issuecomment-513101297

If your issue is different please file it separately.

Oct 20 '19 10:10 bboreham

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Feb 03 '20 11:02 stale[bot]

Hi -- FYI I've found this /ready behavior plays badly with StatefulSets using an "Ordered" Pod Management Policy (the default). I believe the fix is easy -- use a "Parallel" policy -- but documenting the problematic scenario:

Suppose you have 3 SS replicas with "Ordered" policy:

pod-0, pod-1, pod2 are all running
pod-1 & pod-2 have the power yanked at (approximately) the same time
pod-1 is re-started by the SS replica controller
pod-0 is marked as unhealthy, because it can't talk to pod-2
pod-1 becomes healthy
The replica controller is wedged because pod-0 is still unhealthy. pod-2 is never started

I experienced this running with preemptible nodes (I know, I know) and confirmed with manual testing. If the "Parallel" policy is used instead then pod-1 & pod-2 start in parallel and pick up their former places in the ring.

Jun 07 '20 03:06 jgraettinger

pod-0 is marked as unhealthy, because it can't talk to pod-2

Why is pod-0 marked as unhealthy? I can't understand this.

Jun 08 '20 06:06 pracucci

Why is pod-0 marked as unhealthy? I can't understand this.

I'm not sure. Looking at the code, the /ready state is supposed to latch. My observations in a couple runs of the above were that it didn't, or could become unlatched somehow.

Jun 15 '20 22:06 jgraettinger

Why is pod-0 marked as unhealthy? I can't understand this.

I'm not sure. Looking at the code, the /ready state is supposed to latch. My observations in a couple runs of the above were that it didn't, or could become unlatched somehow.

I'm asking because we also run ingesters as statefulsets (in several clusters) and we've never experienced the issue you're describing, so I'm trying to understand how we could reproduce such scenario. Once an ingester switches to ready it should never get back to not-ready, unless a critical issue occurs. Do you see any valuable information in the logs of the ingester switching from ready to not-ready?

Jun 16 '20 06:06 pracucci

The container could have restarted.

Jun 16 '20 08:06 bboreham

Now that chunks storage is deprecated and we use blocks storage, we no longer "hand-over" from one ingester to another. So one justification for this behaviour has disappeared.

Happy to hear experience reports from people who did automate it.

Jul 27 '21 12:07 bboreham

The ingester.autoforget_unhealthy configuration item exists in Loki since this pull request was merged https://github.com/grafana/loki/pull/3919.

Would it be possible to add the same functionality into Cortex?

Or is there another way to facilitate the same behaviour as Loki's ingester.autoforget_unhealthy?

Oct 20 '21 06:10 ctorrisi

cortex cortex copied to clipboard

the ring never removes old ingester even if the ingester pod is evicted

cortex
cortex copied to clipboard