flyte [BUG] Pods stuck on Terminating with finalizer

Describe the bug

Sometimes the finalizer is not removed from pods which results in pods getting stuck in Terminating state. It's not easily replicable but in different occasions we found pods stuck in Terminating. To resolve them we had to manually patch the pod and remove the finalizer. This might be related to other incidents we experienced, but we also noticed 100 days old pods stuck as well

Expected behavior

All pods stuck in terminating state should eventually be handled somehow by flytepropeller, possibly a retry mechanism that would allow for some errors as well but eventually cleanup the pod

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

May 24 '22 14:05 ckiosidis

Not exactly sure how this part works, but is it reasonable to extend plugin interface so a plugin could periodically be called to apply cleanup work?

May 31 '22 10:05 honnix

@ckiosidis @honnix any chance that these pods were parts of subworkflows? Just submitted a PR fixing an issue with aborting subworkflows where FlytePropeller failed to cleanup Pods. I'm looking for any other location this could happen, but not finding anything right now.

Aug 09 '22 00:08 hamersaw

@hamersaw Yes, we do have a lot of subworkflows and it won't be surprising those stuck ones are part of subworkflows. It is awesome you found the bug and we look forward to testing it. Thanks a lot!

Aug 09 '22 08:08 honnix

Hey @hamersaw I confirmed by checking some stuck Pods in our cluster currently. The executions contain subworkflows. The pods are indeed orphans, the flyteworkflow k8s resources are not in the cluster and the executions finished with errors.

Aug 09 '22 09:08 ckiosidis