tempo autoforget_unhealthy for Tempo ingesters

Just wondering if the autoforget_unhealthy feature (implemented in Loki; see https://github.com/grafana/loki/pull/3919) has been made available for Tempo? We are deploying tempo distributed on ECS Fargate and it would be nice if we don't have to click the forget button manually every time we do a rolling update :)

Feb 07 '22 13:02 chenfeilee

Currently only our compactors support autoforgetting unhealthy instances. This was the component our community was having the most trouble with (#1081).

Does Fargate have a similar idea to "statefulset"? An ingester should come back up with the same name and access to the same disk which should prevent this issue from occurring. I'm guessing this doesn't quite work with fargate?

Even if this doesn't work an ingester that shuts down cleanly should remove itself from the ring and this shouldn't be an issue. Are you seeing unhealthy ingesters hanging out in the ring regularly?

Feb 07 '22 14:02 joe-elliott

@joe-elliott Thanks for the prompt response appreciate it!

Does Fargate have a similar idea to "statefulset"?

Unfortunately no :(

Are you seeing unhealthy ingesters hanging out in the ring regularly?

Yes at the moment because we have not implemented graceful shutdown using the shutdown endpoint and queries are failling with error finding ingesters in Querier.FindTraceByID: too many unhealthy instances in the ring. Let me give it a try and see if the ingester remove itself from the ring after graceful shutdown.

Nevertheless, this still feels like an amazing feature for certain use cases e.g. when graceful shutdown is not possible (ingester being OOM killed for example). Just curious if there is any plan to include this autoforget_unhealthy feature in Tempo any time soon? Seems like Cortex is trying to implement this too right now (https://github.com/cortexproject/cortex/issues/1521)

Feb 07 '22 15:02 chenfeilee

So in the past we had discussed adding a config option to completely flush the ingesters on shutdown for situations like this. Currently you have to hit the /flush endpoint manually. This would help in this situation b/c your ingesters would automatically dump all data to the backend and remove themselves from the ring cleanly. Naturally this would require an increased shutdown grace period.

Auto forgetting ingesters should technically only be necessary if there are other issues going on with shutdown. A gracefully exited ingester should not remain in the ring.

Going to ping @mdisibio to see if he has any thoughts as well since he did the auto-forget compactor work.

Feb 08 '22 16:02 joe-elliott

Are you using RF=3 and rolling update 1 pod at a time? These are required to prevent data loss when an ingester's local disk is deleted without calling /flush.

Regarding the two options: adding automatic /flush seems better because it avoids data loss under more scenarios (stateless ingesters as described here, but also other tasks like scaling down). Auto-forget could look like a healthy cluster, but mask underlying data loss.

Long-term I think both changes are useful. Auto-forget seems simple on the surface, but requires switching the ring to the new BasicLifecycler, which may not be trivial.

Feb 09 '22 17:02 mdisibio

Are you using RF=3 and rolling update 1 pod at a time

Not at the moment as we are still at the initial development phase but good point there

@joe-elliott I am slightly confused about the /flush vs /shutdown endpoints. Based on the documentation, seems like I don't have to call the /flush endpoint if I were to use the /shutdown endpoint?

Feb 10 '22 06:02 chenfeilee

and also, when I hit the /shutdown endpoint, I do see this log line occasionally in the ingester shutting-down: Abandoning op in flush queue because ingester is shutting down. Does this mean something undesirable?

Feb 10 '22 07:02 chenfeilee

Abandoning op in flush queue because ingester is shutting down shouldn't happen from calling /shutdown. Is it possible that the pod was deleted before shutdown could complete? Check for the following log messages, and they should be logged in this order:

msg="shutdown handler called" : Logged when /shutdown is called
msg="shutdown handler complete" : Logged after all blocks are flushed to the backend.
msg="module stopped" module=ingester : Logged right after message 2, and indicates the ingester is fully stopped. Now the pod is safe to delete.

Feb 10 '22 17:02 mdisibio

log-shutdown-endpoint-only.txt is what I have retrieved from one of the ingester containers when shutting down. Doesn't seem like the container was deleted before shutdown could complete though.

In the attached, I do see the 3 log messages you have pointed out above but I could also still see the line Abandoning op in flush queue because ingester is shutting down in there.

I am running this on ECS and I have set graceful_shutdown_timeout: 120s in tempo config and also stopTimeout : 120s (Time duration (in seconds) to wait before the container is forcefully killed if it doesn't exit normally on its own) in the ECS task definition.

Also, seems like the /shutdown endpoint didn't manage to flush the traces to backend storage as I don't see the older traces in Grafana anymore (I am getting error Status: 404 Not Found Body: trace not found).

Interestingly, I have tried calling the /flush endpoint before the /shutdown endpoint and looks like it helps (kindly refer to
log-both-flush-and-shutdown-endpoint.txt for the log messages)

I am seeing log lines (e.g. "flushing block", "object uploaded to s3" objectName=...) from calling the /flush endpoint which tells me that the flushing did happen. These lines didn't appear when calling /shutdown endpoint only
I am able to find the older traces in Grafana afterwards

Feb 11 '22 03:02 chenfeilee

I keep running into this problem due to OOMK, based on the limits I've set but this could come from the underlying OS if a node is overcommited on memory which could happen easily

Could someone point how to debug tempo's ring, please?

I keep reading about an endpoint/ui but this doesn't seem to be documented anywhere...?

(update) I managed to find these management UIs here https://cortexmetrics.io/docs/api/ each component has a different ring UI.. ingester/ring, compactor/ring... on port 3100 not on gossip-http 7496 from the tempo-distributed (when using single binary, simple scalable or other form of deployments all the endpoints will be present on the same POD)

I'm using Grafana's tempo-distributed chart and I enabled autoscaling for the ingesters which is a terrible idea because when upscales and later downscales this error shows up

Aug 08 '22 14:08 carlosjgp

Yes, autoscaling on the ingesters is a bad idea :). We need to remove this option from the chart.

https://github.com/grafana/helm-charts/pull/1057#issuecomment-1111453689

Tempo's http endpoints are documented here:

https://grafana.com/docs/tempo/latest/api_docs/

and more extensive ring documentation can be found here:

https://grafana.com/docs/tempo/latest/operations/consistent_hash_ring/

Aug 08 '22 19:08 joe-elliott

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

Nov 16 '22 00:11 github-actions[bot]