oncall icon indicating copy to clipboard operation
oncall copied to clipboard

Celery workers graceful shutdown & restart

Open geowatson opened this issue 2 years ago • 6 comments

For now celery workers are using SIGTERM shutdown within the container itself, thus increasing Restarts value each time.

It would be better to use kubeapi for such task to ensure graceful restart for the service, e.g: client-go example

geowatson avatar Jun 21 '22 09:06 geowatson

Screenshot 2022-07-05 at 16 26 37 Fair comment from our slack community. Tons of pod restarts look weird.

Matvey-Kuk avatar Jul 05 '22 13:07 Matvey-Kuk

We added this task to the core team's backlog.

Matvey-Kuk avatar Jul 06 '22 06:07 Matvey-Kuk

I think we can remove the forced celery restart from hobby and helm installations. SIGTERM is issued once in 65 minutes as an additional measure if the pod is stuck, memory leaked or some other bug. It doesn't break things (see docs), but the counter is indeed annoying. As it is not necessary in the hobby env by the hobby definition and not necessary in the helm setup because it has probes, I propose to make forced restarts optional

iskhakov avatar Jul 14 '22 10:07 iskhakov

@iskhakov this restart is something we learned the hard way... I believe we shouldn't degrade even the "hobby" environment. I also doubt probes are doing proper probing of each worker. I remember there were some issues while probing was testing one worker but another one could get stuck.

Matvey-Kuk avatar Jul 14 '22 10:07 Matvey-Kuk

@Matvey-Kuk I'm sure that the probes will restart the container it it doesn't reply to celery ping, it will work for any worker (pod). We can figure out something similar for hobby if necessary.

I don't thing we should restart the pod through kubeapi from the application as proposed in this issue.

iskhakov avatar Jul 15 '22 07:07 iskhakov

@iskhakov agree, thank you :)

Matvey-Kuk avatar Jul 18 '22 08:07 Matvey-Kuk