Aim for error free rolling bounces and upgrades of Temporal
Currently a rolling bounce or rolling upgrade of Temporal results in lots of error messages related to shard stealing. Error messages include:
Failed to update shard
Error updating ack level for shard
Error updating timer ack level for shard
These errors make it harder to identify real issues during a deployment and lead operators to tend to ignore these potentially important error messages.
Ideally, we should be able to do a rolling bounce or rolling upgrade of Temporal without encountering these errors on a routine basis.
One thing to keep in mind here is that Temporal is very often deployed using Kubernetes. Any solution should be compatible with the default Kubernetes Deployment RollingUpdate strategy.
I wonder if there is a way to throttle these error messages into warnings with a counter and then issue an error if the counter is exceeded? In practice that is what I have done with my monitoring systems for temporal, essentially ignoring a certain level of these errors during deploys established by looking at historical deploys.
Another alternative might be to have some kind of knowledge that a shard rebalance or ringchange event is in progress and for a duration after that occurs drop these messages to a warning level.
We are seeing this issue a lot with hosting temporal in ECS. Typically on every deployment we see ~1,100 errors and 3 different error messages make up about 80% of those:
- Error updating queue state
- service failures
- failed reaching server: Frontend is not healthy yet
These errors tend to subside within 10 minutes of the deployment so I believe they are related to this rolling bounce issue with temporal.
We ultimately want to make sure that we have good alerting that can accurately alert us when there is an issue but currently the error logs are very noisy. Since this issue has been open for a few years I'm curious if there are any updates on it or if folks have found any workarounds for this?