argo-workflows feat: Add finalizer to workflow pod to prevent 'pod deleted'. Fixes #8783 Continuing Work of #9058

Fixes #8783 Continuing Work of #9058

Motivation

Building upon the foundational work of #9058, I am advancing the implementation to address our needs. see: https://github.com/argoproj/argo-workflows/pull/9058#issuecomment-1869324681

Modifications

Verification

ut and e2e tests

Dec 26 '23 07:12 sakai-ast

I'm surprised but I've just become aware that the Argo Server deletes Pods as part of RetryWorkflow() (here). I figured it was something that would only be the role of the Controller to do, but in any case, I guess this is an example where pod deletion is happening outside of these calls. I suppose the labelPodCompleted code would happen in any case, but I'm not sure what happens if you try to invoke this on an incomplete Workflow.

Jan 10 '24 00:01 juliev0

I suppose the labelPodCompleted code would happen in any case

I agree. In my local tests, retried pods always have the labelPodCompleted and the finalizer is removed.

but I'm not sure what happens if you try to invoke this on an incomplete Workflow.

I believe the feature flag is intended for situations like the one described above.

Jan 10 '24 10:01 sakai-ast

Would you mind adding to the Description more information on how you decided on your approach and your investigation of corner cases?

Jan 12 '24 00:01 juliev0

Here's another case: let's say that somebody is prematurely deleting their Workflow before it's done. We want to make sure the Finalizer gets removed from the Pods. I'm not sure if this calls "terminateContainers" - where your logic was before - or not, but here is a possible alternative solution: say there's a Finalizer on the Workflow, which causes the Workflow Controller to first remove the finalizer on all the Pods before removing the Finalizer on the Workflow itself.

Generally speaking, I think our strategy for analyzing this should be to determine what all the ways are for a pod to be deleted, and try to make sure that all of those will be handled. If I search the code I see 3 instances of Delete() called for Pods:

workflow/controller/controller.go
server/workflow/workflow_server.go
server/workflowarchive/archived_workflow_server.go
implicitly through deletion of the parent Workflow which has an OwnerReference to it

Jan 12 '24 22:01 juliev0

Would you mind adding to the Description more information on how you decided on your approach

I utilized the approach from the original PR under the assumption that it had already undergone review. Other implementations in my approach are based on feedback received during the review of the original PR, such as the addition of a feature flag, and also include some tests for the feature flag. Would you like me to need to add this explanation to the Description of this PR?

and your investigation of corner cases?

My verification for this PR primarily consists of e2e and unit tests. I haven't identified any additional corner cases beyond those covered in the unit and e2e tests.

Here's another case: let's say that somebody is prematurely deleting their Workflow before it's done. We want to make sure the Finalizer gets removed from the Pods. I'm not sure if this calls "terminateContainers" - where your logic was before

I think this is controlled by the applyExecutionControl(). This function does invoke terminateContainers. In a previous discussion, I removed the finalizer removal logic from both terminateContainers and killContainers based on the understanding that it was redundant. However, considering the scenario you described, it seems we might need to revisit this decision and potentially re-add the logic to handle such cases.

Generally speaking, I think our strategy for analyzing this should be to determine what all the ways are for a pod to be deleted, and try to make sure that all of those will be handled. If I search the code I see 3 instances of Delete() called for Pods:

I concur that the Delete() function is called for Pods in the 3 instances you mentioned above. I have been reviewing the code to identify any additional logic or potential corner cases related to pod deletion. However, I have yet to find any further instances beyond those already discussed.

If you have concerns that the implementation in this PR may not be comprehensive enough, I would greatly appreciate the involvement of another reviewer with more in-depth knowledge of pod deletion logic to provide additional insights.

But regarding the feature flag, I understand it addresses concerns the finalizer for pods like above, which is currently set to false by default. After this PR is merging, If any issues arise when this flag is true, my expectation is that we should be address them in a subsequent PR. This is how I envision the purpose and use of the feature flag.

Please let me know if my understanding needs correction or clarification.

Jan 15 '24 09:01 sakai-ast

Hey @isubasinghe - would you be interested in also reviewing this PR? I believe you may have worked on pod deletion-related PRs unless I’m mistaken?

Jan 16 '24 01:01 juliev0

Thank you @sakai-ast ! I apologize on all the iterations, including potentially needing to add back code that we decided to remove.

I agree on the feature flag really helping to lessen the concerns of scenarios missed. At the same time, it would be great to identify any possible scenarios ahead of time and test anything we're not sure of.

If we think the logic for terminateContainers should go back, then maybe re-add that, and then I'll be comfortable if you could just test these 2 cases and make sure the finalizer is removed and pods get deleted:

use of "Retry Workflow" through the Argo Server (and with that, see if you're able to retry the workflow even before the Workflow has completed)
premature deletion of a Workflow through e.g. kubectl delete

Jan 16 '24 02:01 juliev0

@juliev0 Yeah I did do some work on pod deletion I think, I can have a look now.

Jan 16 '24 10:01 isubasinghe

@juliev0 Yeah I did do some work on pod deletion I think, I can have a look now.

Appreciate it. Thanks! Note that there is a feature flag for this, so less risk.

Jan 16 '24 17:01 juliev0

@juliev0 I have conducted tests based on your scenarios in the current PR implementation.

Preparation

To enable ARGO_POD_STATUS_CAPTURE_FINALIZER=true in RUN_MODE=kubernetes , I added the following to the workflow-controller deployment settings(because in my local, the UI on the browser work properly when the RUN_MODE=kubernetes).

--- a/test/e2e/manifests/mixins/workflow-controller-deployment.yaml
+++ b/test/e2e/manifests/mixins/workflow-controller-deployment.yaml
@@ -8,4 +8,7 @@ spec:
     spec:
       containers:
         - name: workflow-controller
-          imagePullPolicy: Never
\ No newline at end of file
+          imagePullPolicy: Never
+          env:
+          - name: ARGO_POD_STATUS_CAPTURE_FINALIZER
+            value: "true"

built the current code into images and start

make workflow-controller-image argocli-image IMAGE_NAMESPACE=finalizer-test
make start PROFILE=minimal RUN_MODE=kubernetes IMAGE_NAMESPACE=finalizer-test AUTH_MODE=client STATIC_FILES=false LOG_LEVEL=debug API=true UI=true

Tests

Retry Workflow through Argo Server

RETRY, SUBMIT, RESUBMIT

I observed that RetryWorkflow() is called from the UI's RETRY button. The RETRY button is enabled only after the workflow completion, meaning we cannot retry a workflow before it completes. When retried, all pods already have workflows.argoproj.io/completed: "true" and the finalizer is removed. Here are server logs when the RETRY button is pressed.

time="2024-01-16T11:25:27.063Z" level=info msg="Deleting pod" podDeleted=fail-4wdjs
time="2024-01-16T11:25:27.071Z" level=info msg="Workflow to be dehydrated" Workflow Size=995
time="2024-01-16T11:25:27.077Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=RetryWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2024-01-16T11:25:27Z" grpc.time_ms=20.98 span.kind=server system=grpc
time="2024-01-16T11:25:27.077Z" level=info duration=22.54035ms method=PUT path=/api/v1/workflows/argo/fail-4wdjs/retry size=1799 status=0

RETRY Screenshot 2024-01-16 at 20 29 19 during SUBMIT and RESUBMIT (RETRY button does not appear) Screenshot 2024-01-16 at 20 42 36

TERMINATE

Terminating a running workflow using the TERMINATE button worked as expected, with related pods marked completed and finalizers removed.

time="2024-01-16T12:12:12.259Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=TerminateWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2024-01-16T12:12:12Z" grpc.time_ms=17.113 span.kind=server system=grpc
time="2024-01-16T12:12:12.260Z" level=info duration=18.29799ms method=PUT path=/api/v1/workflows/argo/sleep-svbpc/terminate size=2122 status=0

Premature Deletion of Workflow via kubectl delete

Deleting a running workflow using kubectl delete properly removed the workflow's pods, even when the pods had finalizers, and the workflow status changed to failed. This appears to be the correct behavior.

Problem

On UI

Deleting a running workflow via DELETE button removes the workflow, but related pods get stuck in the Terminating state due to remaining finalizers.

time="2024-01-16T11:57:18.461Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2024-01-16T11:57:18Z" grpc.time_ms=18.524 span.kind=server system=grpc
time="2024-01-16T11:57:18.461Z" level=info duration=19.956665ms method=DELETE path=/api/v1/workflows/argo/fail-mp2k4 size=2 status=0

This behavior is due to DeleteWorkflow() not accounting for pods with finalizers. It attempts to delete workflow before the controller manages related pods. I believe an additional logic is needed to address this, like controller reconciliation of pod termination status and finalizer removal when eligible.

Argo CLI

Using argo delete for a running workflow results in the same issue as using the DELETE button, with related pods remaining in the Terminating state.

Conclusion

The tests suggests that merely adding logic to remove finalizers in terminateContainers might not be sufficient. Implementing a routine to periodically identify and remove finalizers from pods stuck in the Terminating status, as proposed with the controller reconciliation of pod termination status and finalizer removal when eligible , would more reliably prevent situations where pods that should be deleted are not.

Please review this and let me know if additional implementation is needed.

Jan 17 '24 02:01 sakai-ast

Thank you so much for all of the testing @sakai-ast . Do you remember the functionality that was in this PR that we decided to remove in the DeleteFunc from controller.go? I wonder if the issue you're seeing would be addressed if that were added back? And then if we did that, maybe we don't need the terminateContainers logic? Does it make sense?

Jan 17 '24 17:01 juliev0

@juliev0 I have re-added the finalizer removal logic in the DeleteFunc https://github.com/argoproj/argo-workflows/pull/12413/commits/b1c54ac8f28e855573690330a55610e4dcc49002

and the following issue has resolved.

Deleting a running workflow via DELETE button removes the workflow, but related pods get stuck in the Terminating state due to remaining finalizers.

Using argo delete for a running workflow results in the same issue as using the DELETE button, with related pods remaining in the Terminating state.

I'm not sure how it works, but adding the following temporary log settings to the DeleteFunc for debugging purposes and it logs showed the function working properly.

--- a/workflow/controller/controller.go
+++ b/workflow/controller/controller.go
@@ -1050,6 +1050,7 @@ func (wfc *WorkflowController) addWorkflowInformerHandlers(ctx context.Context)
                                                if err := enablePodForDeletion(ctx, pods, &p); err != nil {
                                                        log.WithError(err).Error("Failed to enable pod for deletion")
                                                }
+                                               log.Info("pod is enabled for pod deletion in DeleteFunc()")
                                        }

                                        key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)

controller logs(argo delete executed)

time="2024-01-18T05:26:14.749Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=sleep-grln9
time="2024-01-18T05:26:14.749Z" level=info msg=reconcileAgentPod namespace=argo workflow=sleep-grln9
time="2024-01-18T05:26:14.749Z" level=info msg="Workflow to be dehydrated" Workflow Size=1155
time="2024-01-18T05:26:14.762Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=350104 workflow=sleep-grln9
time="2024-01-18T05:26:16.950Z" level=info msg="pod is enabled for pod deletion in DeleteFunc()"

How did you notice this and do you know how to call the deleteFunc?

And then if we did that, maybe we don't need the terminateContainers logic? Does it make sense?

Yes it makes sense.

Jan 18 '24 06:01 sakai-ast

@juliev0 I have re-added the finalizer removal logic in the DeleteFunc b1c54ac

and the following issue has resolved.

[...]

How did you notice this and do you know how to call the deleteFunc? [...]

Great! So, the DeleteFunc is basically a callback called by the internal K8S Informer framework (described here) when the Workflow is scheduled for deletion.

Jan 18 '24 20:01 juliev0

🍾

Feb 05 '24 15:02 alexec

if workflow-controller stopped for a while, eg, maintaining or upgrading, and then workflows deleted by somebody, if this will still work properly? during the gap, some pods completed, but the finalizer have no chance to get removed, how can we ensure that pods can be deleted.

Mar 07 '24 07:03 imliuda

You can't ensure the pods will be deleted. The deletion will hang until either the controller is started or someone through another means deletes the finalizer words on the pod. The point of this change is to do this, and prevent inconsistency of state between the workflow and pods.

If you are happy with the alternative that a workflow may have unexpectedly missing pods you may delete the finalizer manually from the pods. This is scriptable.

Mar 07 '24 08:03 Joibel

You can't ensure the pods will be deleted. The deletion will hang until either the controller is started or someone through another means deletes the finalizer words on the pod. The point of this change is to do this, and prevent inconsistency of state between the workflow and pods.

If you are happy with the alternative that a workflow may have unexpectedly missing pods you may delete the finalizer manually from the pods. This is scriptable.

It is unnecessary for keeping consistency between workflow and pods, once workflow-controller has finished processing pods (that is node.Fullfilled()), then the pod can be deleted safely. Or, that may induce a memory leak of kube-apiserver due to large amount pods (kube-controller-manager will do pod gc of Failed and Succeed pods). Since finalizer is added by workflow-controller, then it should have to be responsible for removing it when the corresponding workflow deleted. Removing finalizer by a script is not appropriate.

I can make some optimizations for situation when pod's owner workflow has been deleted, and open a new PR for that.

Mar 14 '24 06:03 imliuda