oncall
oncall copied to clipboard
Celery workers graceful shutdown & restart
For now celery workers are using SIGTERM shutdown within the container itself, thus increasing Restarts value each time.
It would be better to use kubeapi for such task to ensure graceful restart for the service, e.g: client-go example

We added this task to the core team's backlog.
I think we can remove the forced celery restart from hobby and helm installations. SIGTERM is issued once in 65 minutes as an additional measure if the pod is stuck, memory leaked or some other bug. It doesn't break things (see docs), but the counter is indeed annoying. As it is not necessary in the hobby env by the hobby definition and not necessary in the helm setup because it has probes, I propose to make forced restarts optional
@iskhakov this restart is something we learned the hard way... I believe we shouldn't degrade even the "hobby" environment. I also doubt probes are doing proper probing of each worker. I remember there were some issues while probing was testing one worker but another one could get stuck.
@Matvey-Kuk I'm sure that the probes will restart the container it it doesn't reply to celery ping, it will work for any worker (pod). We can figure out something similar for hobby if necessary.
I don't thing we should restart the pod through kubeapi from the application as proposed in this issue.
@iskhakov agree, thank you :)