[BUG] Pods stuck on Terminating with finalizer
Describe the bug
Sometimes the finalizer is not removed from pods which results in pods getting stuck in Terminating state. It's not easily replicable but in different occasions we found pods stuck in Terminating. To resolve them we had to manually patch the pod and remove the finalizer. This might be related to other incidents we experienced, but we also noticed 100 days old pods stuck as well
Expected behavior
All pods stuck in terminating state should eventually be handled somehow by flytepropeller, possibly a retry mechanism that would allow for some errors as well but eventually cleanup the pod
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
Not exactly sure how this part works, but is it reasonable to extend plugin interface so a plugin could periodically be called to apply cleanup work?
@ckiosidis @honnix any chance that these pods were parts of subworkflows? Just submitted a PR fixing an issue with aborting subworkflows where FlytePropeller failed to cleanup Pods. I'm looking for any other location this could happen, but not finding anything right now.
@hamersaw Yes, we do have a lot of subworkflows and it won't be surprising those stuck ones are part of subworkflows. It is awesome you found the bug and we look forward to testing it. Thanks a lot!
Hey @hamersaw I confirmed by checking some stuck Pods in our cluster currently. The executions contain subworkflows. The pods are indeed orphans, the flyteworkflow k8s resources are not in the cluster and the executions finished with errors.