oncall Celery workers graceful shutdown & restart

Celery workers graceful shutdown & restart

Open geowatson opened this issue 2 years ago • 6 comments

For now celery workers are using SIGTERM shutdown within the container itself, thus increasing Restarts value each time.

It would be better to use kubeapi for such task to ensure graceful restart for the service, e.g: client-go example

Jun 21 '22 09:06 geowatson

Screenshot 2022-07-05 at 16 26 37 Fair comment from our slack community. Tons of pod restarts look weird.

Jul 05 '22 13:07 Matvey-Kuk

We added this task to the core team's backlog.

Jul 06 '22 06:07 Matvey-Kuk

I think we can remove the forced celery restart from hobby and helm installations. SIGTERM is issued once in 65 minutes as an additional measure if the pod is stuck, memory leaked or some other bug. It doesn't break things (see docs), but the counter is indeed annoying. As it is not necessary in the hobby env by the hobby definition and not necessary in the helm setup because it has probes, I propose to make forced restarts optional

Jul 14 '22 10:07 iskhakov

@iskhakov this restart is something we learned the hard way... I believe we shouldn't degrade even the "hobby" environment. I also doubt probes are doing proper probing of each worker. I remember there were some issues while probing was testing one worker but another one could get stuck.

Jul 14 '22 10:07 Matvey-Kuk

@Matvey-Kuk I'm sure that the probes will restart the container it it doesn't reply to celery ping, it will work for any worker (pod). We can figure out something similar for hobby if necessary.

I don't thing we should restart the pod through kubeapi from the application as proposed in this issue.

Jul 15 '22 07:07 iskhakov

@iskhakov agree, thank you :)

Jul 18 '22 08:07 Matvey-Kuk

oncall oncall copied to clipboard

Celery workers graceful shutdown & restart

oncall
oncall copied to clipboard