tempo Improve frontend and querier error handling

trafficstars

Here we aim to improve the error handling between the querier and the frontend. Several errors can be passed in either direction depending on the situation, including context cancellation and gRPC "queue stopped" messages.

In some cases, we simply avoid logging the error, recognizing it as part of the normal connection lifecycle in worker-frontend relationship. In the case a worker fails to contact the frontend, the worker will will retry regardless of the error. This behavior is unchanged. Frontends only get removed from the worker by the DNS record handling for the frontend address.

Additionally, a 1 second delay was added to the frotnend shutdown. This is to allow the situation where both the frontend and querier modules are running in the same binary. Without this delay, there is a race between the frontend and the querier to determine who can tell who a shutdown has been requested. This delay allows the query worker to notify the frontend of a shutdown request so that both components can shut down cleanly.

Checklist

[ ] Tests updated
[ ] Documentation added
[ ] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Oct 31 '22 21:10 zalegrala

Generally ok with this PR but have some thoughts.

Much of this code was inherited from cortex and reused across mimir and loki. I would like to prevent divergence between things if possible - do you know if their frontends have updated error handling logic too?
Is there an alternative to sleeping in the frontend shutdown?

Nov 02 '22 12:11 mdisibio

It doesn't look like Loki is handling this currently, but I'm not sure what to make of that since we copied the code.

A few other solutions to the graceful disconnect came to mind when I was working on this.

Update the service handling to support ordering. This might get strange since in microservices mode these two modules aren't run in the same binary.
Update the logic in the frontend to allow connections for termination notification instead of refusing with "queue closed". This might get strange because we want to avoid taking connections, but just enough to allow for disconnect.
We could potentially rework the shutdown on both sides.

We could also decided not to take this change and just deal with the errors in the logs. I put this work up because when restating an SSB cluster, the amount of error and warning logs in there for a completely normal restart of the sts left a less than great taste in my mouth. But we could choose to accept the extra logging.

Nov 02 '22 14:11 zalegrala

Perhaps let me take a look at some e2e testing. Ideas welcome.

Nov 03 '22 17:11 zalegrala

I've added a commit to replace the frontend shutdown sleep with a module dependency. I included SingleBinary as part of the target, but I don't expect this to be as much of an issue compared to ScalableSingleBinary. I also was able to confirm that the shutdown in the services handling is ordered properly, and included a test upstream if helpful.

Dec 02 '22 16:12 zalegrala

This PR has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. This pull request will be closed in 15 days if there is no new activity. Please apply keepalive label to exempt this Pull Request.

Feb 01 '23 00:02 github-actions[bot]

tempo tempo copied to clipboard

Improve frontend and querier error handling

tempo
tempo copied to clipboard