cortex
cortex copied to clipboard
the ring never removes old ingester even if the ingester pod is evicted
I have a similar problem as #1502 when my ingester pod was evicted , a new ingester pod will be created . now the ring has two ingester, but only one (the new one) is healthy. the old one will not be removed from the ring, even if I delete the evict pod manually. the ring information as follows:
`
<tr>
<td>ingester-7fc8759d7f-nzb6g</td>
<td>ACTIVE</td>
<td>172.16.0.62:9095</td>
<td>2019-07-19 03:33:32 +0000 UTC</td>
<td>128</td>
<td>45.739077787319914%</td>
<td><button name="forget" value="ingester-7fc8759d7f-nzb6g" type="submit">Forget</button></td>
</tr>
<tr>
<td>ingester-7fc8759d7f-wmnms</td>
<td>Unhealthy</td>
<td>172.16.0.93:9095</td>
<td>2019-07-18 14:46:18 +0000 UTC</td>
<td>128</td>
<td>54.260922212680086%</td>
<td><button name="forget" value="ingester-7fc8759d7f-wmnms" type="submit">Forget</button></td>
</tr>
` and the ingester's status is always unready, with distributor's error
level=warn ts=2019-07-19T03:41:45.413839063Z caller=server.go:1995 traceID=daf4028f530860f msg="POST /api/prom/push (500) 727.847µs Response: \"at least 1 live ingesters required, could only find 0\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 3742; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.0; X-Forwarded-For: 172.16.0.17; X-Forwarded-Host: perf.monitorefk.huawei.com; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.16.0.17; X-Request-Id: 62a470dc6de7a83c8974e3411fa63e40; X-Scheme: https; X-Scope-Orgid: custom; "
I wonder if there is any solution to deal with the situaton automatically? maybe to check the replicas-refactor and remove unhealthy excess ingesters from the ring?
If the ingester shut down cleanly, even on eviction, then it would not be in the ring. So, the first task is to find out why it did not shut down cleanly, and if possible fix that.
Everything else you report is deliberate. We return not-ready to halt a rolling update.
Actually I don’t understand could only find 0
, since your ring shows 1 active.
what do you mean about "shut down cleanly"?
I try to reproduce the problem by delete pod --force. and a new ingester pod is produced by deployment controller immediately.
nowI get an ring has two ingester(one is active,the other is unhealthy),an unready state for my new ingester(because there is an unhealthy pod, so 503 returned for readiness check), and a distributor log like this
level=warn ts=2019-07-19T08:18:25.511613228Z caller=logging.go:49 traceID=5eabc2f0f6b837e9 msg="POST /api/prom/push (500) 266.759µs Response: \"at least 2 live ingesters required, could only find 1\\n\" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5162; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.1; X-Forwarded-For: 100.95.185.106; X-Forwarded-Host: 100.95.137.223; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 100.95.185.106; X-Request-Id: 89b05193a60c7935ac6a7bcd090b9a16; X-Scheme: https; X-Scope-Orgid: primary
I'm confused because my -distributor.replication-factor is 1, so by minSuccess := (replicationFactor / 2) + 1 , my distributor only need at 1 live ingester, but the log tells me I need two.
so is there anything I misunderstood?
I wonder when the ring adds ingester and when to remove? Is consul do it by itself , or ingester tell it what to do? I notice when ingester start and shutdown, it will tell ring. But what if the ingester is shutdown unclearly ,is there any solutions to automatically clean the unhealthy pod in the ring ?
by the way , after I restart my consul , the ring will only have the active one and anything works well.
I mean the ingester went through its exit sequence, rather than being abruptly terminated from outside.
There are two main cases: hand-over to another ingester, and flush to store. In both cases the time required is a function of how much data is in memory.
When using an explicitly provisioned store (eg DynamoDB) it would be nice to scale up specifically for a “save everything” operation. There’s no code to do that currently.
I try to reproduce the problem by delete pod --force. and a new ingester pod is produced by deployment controller immediately. nowI get an ring has two ingester(one is active,the other is unhealthy),an unready state for my new ingester(because there is an unhealthy pod, so 503 returned for readiness check), and a distributor log like this level=warn ts=2019-07-19T08:18:25.511613228Z caller=logging.go:49 traceID=5eabc2f0f6b837e9 msg="POST /api/prom/push (500) 266.759µs Response: "at least 2 live ingesters required, could only find 1\n" ws: false; Connection: close; Content-Encoding: snappy; Content-Length: 5162; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.11.1; X-Forwarded-For: 100.95.185.106; X-Forwarded-Host: 100.95.137.223; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Original-Uri: /api/prom/push; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 100.95.185.106; X-Request-Id: 89b05193a60c7935ac6a7bcd090b9a16; X-Scheme: https; X-Scope-Orgid: primary
I'm confused because my -distributor.replication-factor is 1, so by minSuccess := (replicationFactor / 2) + 1 , my distributor only need at 1 live ingester, but the log tells me I need two.
so is there anything I misunderstood?
I wonder when the ring adds ingester and when to remove? Is consul do it by itself , or ingester tell it what to do? I notice when ingester start and shutdown, it will tell ring. But what if the ingester is shutdown unclearly ,is there any solutions to automatically clean the unhealthy pod in the ring ?
by the way , after I restart my consul , the ring will only have the active one and anything works well.
I know... you do this
replicationFactor := r.cfg.ReplicationFactor
if len(ingesters) > replicationFactor {
replicationFactor = len(ingesters)
}
so the replicationFactor is 2 now, instead of what I set in -distributor.replication-factor. it is in case of node joining/leaving, but will cause write failure in my case above.
That sounds like the same problem as #1290 @tomwilkie can you remember what that check is for? https://github.com/cortexproject/cortex/blob/7cf06900f6fea998724a46a8534ce13ee94c1b67/pkg/ring/replication_strategy.go#L20
Actually If I deploy one ingester and replicationFactor is 1, then ingester pod was evicted because of low memory and kubelet restart another ingester pod.
However the previous ingester didn't exit cleanly, actually the corresponding entry in the ring of consul will never be cleaned. So at this moment:
replicationFactor == len(ingesters) = 2
(1 eviceted ingester and 1 running ingester)
minSuccess = (replicationFactor / 2) + 1 = 2
However len(liveIngesters) = 1 < minSuccess
---> There is is a deadlock: unhealthy ingester never cleaned from the ring and we'll never reach minSuccess.
Two problems here:
-
Why len(ingester) > replicationFactor, then replicationFactor = len(ingesters). ---> In fact, the ingesters more than original replicationFactor is usually not normal.
-
If the ingester is unhealthy for a long time, why not clean it out from the ring. ---> The residual ingester will affect the replication strategy.
@bboreham @tomwilkie @csmarchbanks Any ideas ?
The current design requires that you set terminationGracePeriodSeconds
long enough to shut down the first ingester cleanly.
Your point 1 seems the same as #1290
Point 2 because we don't have enough experience of situations that need this. We would probably add it as an option if someone was to submit a PR.
@bboreham If ingester is killed because of OOM (Actually ingester consume a lot memory and it's very common in k8s, at least very common in my k8s environment), then it will never have terminationGracePeriodSeconds
to shut down gracefully 😂
For point 1, I think replicationFactor is configured by user, so set it constant maybe more reasonable.
For point 2, I may need to read more code to better understand the design intent. If it's necessary, I'd like to make an PR to fix it.
"killed because of OOM" is not the same thing as "evicted". A pod that is OOM-killed will restart with the same identity on the same node, hence pick up the same entry in the Cortex ring. Unless you have evidence to the contrary?
@bboreham You are right.
In our environment, the kubelet was configured with hard eviction, so the ingester pod was evicted without graceful period.
However even configure kubelet with soft eviction, I have no idea of how to configure the eviction-max-pod-grace-period
. Because the grace period needed by ingester may related to the amount of data it contains.
If after configured grace period, ingester still can't exit cleanly. Then the problem still can't be solved.
Hi
Apparently I ran into the same issue, although I saw msg="error removing stale clients" err="too many failed ingesters"
as well in the logs.
Having a look at the code my assumption is the following (please correct me if im wrong):
-
pkg/ingester/client/pool.go#removeStaleClients()
calls top.ring.GetAll()
-
pkg/ring/ring.go#GetAll()
returns error withtoo many failed ingesters
because of the way how the logic is implemented using maxErrors.
Some questions:
- why does GetAll need to check against the non-healthy instances at all?
- if that check is needed, what is the purpose behind calculating the maxErrors from the Unhealthy instances instead of calculating the ACTIVE ones and using the ReplicationFactor? https://github.com/cortexproject/cortex/blob/1ca4ad0de1208d574531e27ac7576bfb5c03756c/pkg/ring/ring.go#L258-L286
After changing the aforementioned code (line 278-280) to the following, I stopped receiving "error removing stale client"
:
if len(ingesters) < r.cfg.ReplicationFactor / 2 + 1 {
return ReplicationSet{}, fmt.Errorf("not enough healthy ingesters (ingesters: %d, replicationFactor: %d)", len(ingesters), r.cfg.ReplicationFactor)
}
also when I set distributor.health-check-ingesters
enabled, the Unhealthy ingesters got cleaned up properly. (so #1264 might be related as well) EXCEPT when it was an OOM or using ECS and your ingester comes back with the same identity (as you mentioned), on the same IP/host. Meaning there's no logic to transition from LEAVING to ACTIVE again. Shouldn't there be some logic created around that?
Note that distributor.client-cleanup-period
(def 15s) & distributor.health-check-ingesters
(def: false) controls the ingester cleanup, which will clean up Unhealthy ingesters if you have the health-check-ingesters
enabled.
https://github.com/cortexproject/cortex/blob/1ca4ad0de1208d574531e27ac7576bfb5c03756c/pkg/ingester/client/pool.go#L75-L92
Note that distributor.client-cleanup-period (def 15s) & distributor.health-check-ingesters (def: false) controls the ingester cleanup, which will clean up Unhealthy ingesters if you have the health-check-ingesters enabled.
I tried this. but it just removes unhealthy ingester from distirbutor pool( which holds ingester clients ) instead of removing them from consul ring. It doesn't work for me.
Don't read too much into the words - that's removing them from one data structure in memory. There is no code to remove ingesters from the ring when they are suspected to be dead, and this was deliberate.
@bboreham what was the idea behind not removing ingesters from the ring when they are suspected to be dead?
and then what is the purpose of removeStaleClients
and cleanUnhealhty
? (which is just removing the unhealthy ingesters from the distributor pool only)
what was the idea behind not removing ingesters from the ring when they are suspected to be dead?
Risky, easy to get wrong, not necessary day one.
and then what is the purpose of
removeStaleClients
that was to fix #217
Here's an example scenario we want to avoid: Cortex is running under Kubernetes, and a rolling update begins:
- Kubernetes sends SIGTERM to one old ingester and starts one new ingester.
- A bug in the new code means hand-over from old to new fails.
- Terminating ingester starts to flush all data, but can't flush quickly enough and runs out of time.
- Kubernetes terminates the first ingester and removes its pod entry.
Now, if we allow the rolling update to proceed, the same thing will happen in each case and we will lose the unflushed data from all ingesters, which could be a significant proportion of all data in the last 12 hours.
With the current code the rolling update is halted because there will be an "unhealthy" entry for the old ingester in the ring, and this means the new ingester will never show "ready" to Kubernetes.
@bboreham Yes, it's exactly the scenario we encountered and it's annoying.
I think you would find losing half the data more annoying than having to operate the system manually when there is a fault.
I also hit this issue. I was able to work around it by completely wiping the slate clean, but it's not ideal.
If you indeed hit the same issue please follow the steps in https://github.com/cortexproject/cortex/issues/1521#issuecomment-513101297
If your issue is different please file it separately.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Hi -- FYI I've found this /ready
behavior plays badly with StatefulSets using an "Ordered" Pod Management Policy (the default). I believe the fix is easy -- use a "Parallel" policy -- but documenting the problematic scenario:
Suppose you have 3 SS replicas with "Ordered" policy:
- pod-0, pod-1, pod2 are all running
- pod-1 & pod-2 have the power yanked at (approximately) the same time
- pod-1 is re-started by the SS replica controller
- pod-0 is marked as unhealthy, because it can't talk to pod-2
- pod-1 becomes healthy
- The replica controller is wedged because pod-0 is still unhealthy. pod-2 is never started
I experienced this running with preemptible nodes (I know, I know) and confirmed with manual testing. If the "Parallel" policy is used instead then pod-1 & pod-2 start in parallel and pick up their former places in the ring.
pod-0 is marked as unhealthy, because it can't talk to pod-2
Why is pod-0 marked as unhealthy? I can't understand this.
Why is pod-0 marked as unhealthy? I can't understand this.
I'm not sure. Looking at the code, the /ready
state is supposed to latch. My observations in a couple runs of the above were that it didn't, or could become unlatched somehow.
Why is pod-0 marked as unhealthy? I can't understand this.
I'm not sure. Looking at the code, the
/ready
state is supposed to latch. My observations in a couple runs of the above were that it didn't, or could become unlatched somehow.
I'm asking because we also run ingesters as statefulsets (in several clusters) and we've never experienced the issue you're describing, so I'm trying to understand how we could reproduce such scenario. Once an ingester switches to ready it should never get back to not-ready, unless a critical issue occurs. Do you see any valuable information in the logs of the ingester switching from ready to not-ready?
The container could have restarted.
Now that chunks storage is deprecated and we use blocks storage, we no longer "hand-over" from one ingester to another. So one justification for this behaviour has disappeared.
Happy to hear experience reports from people who did automate it.
The ingester.autoforget_unhealthy
configuration item exists in Loki since this pull request was merged https://github.com/grafana/loki/pull/3919.
Would it be possible to add the same functionality into Cortex?
Or is there another way to facilitate the same behaviour as Loki's ingester.autoforget_unhealthy
?