argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

feat: add finalizer to workflow pod to prevent "pod deleted". Fixes #8783

Open rohankmr414 opened this issue 2 years ago • 16 comments

Signed-off-by: Rohan Kumar [email protected]

Fixes #8783

rohankmr414 avatar Jun 25 '22 14:06 rohankmr414

@rohankmr414 could you write a test case or add an example of this to your PR description?

@juliev0 @sarabala1979 the e2e tests for this PR are all failing with 15m0 timeout, something to note I assume

dpadhiar avatar Jun 27 '22 22:06 dpadhiar

@rohankmr414 Regarding the test failures, we should determine if they're "real" or intermittently failing. Have you tried to run them locally? You should be able to do "make [testname]" like "make TestMarkDaemonedPodSucceeded"

juliev0 avatar Jun 27 '22 23:06 juliev0

@rohankmr414 It is a nice fix. Please add unit test and e2e test. Fix the test failures.

sarabala1979 avatar Jun 29 '22 15:06 sarabala1979

=== RUN   TestDaemonPodSuite/TestMarkDaemonedPodSucceeded
Submitting workflow  daemoned-pod-completed-
Waiting 1m0s for workflow metadata.name=daemoned-pod-completed-x6jsz
 ? daemoned-pod-completed-x6jsz Workflow 0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ◷ daemoned                     Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ◷ daemoned                     Pod      0s      PodInitializing

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ◷ daemon-dependent             Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ◷ daemon-dependent             Pod      0s      PodInitializing

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ● daemon-dependent             Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ● daemon-dependent             Pod      0s      

 ✔ daemoned-pod-completed-x6jsz   Workflow 9s      
 └ ✔ daemoned-pod-completed-x6jsz DAG      9s      
 └ ✔ daemoned                     Pod      3s      
 └ ✔ daemon-dependent             Pod      5s      

 ✔ daemoned-pod-completed-x6jsz   Workflow 9s      
 └ ✔ daemoned-pod-completed-x6jsz DAG      9s      
 └ ✔ daemoned                     Pod      3s      
 └ ✔ daemon-dependent             Pod      5s      

Condition "to be Succeeded" met after 9s
Checking expectation daemoned-pod-completed-x6jsz
daemoned-pod-completed-x6jsz : Succeeded 
=== PASS: DaemonPodSuite/TestMarkDaemonedPodSucceeded
    --- PASS: TestDaemonPodSuite/TestMarkDaemonedPodSucceeded (9.54s)
PASS

Screenshot from 2022-07-13 19-56-19

@sarabala1979 @juliev0 For some reason, the tests are passing on my local env with the finalizer, not sure what's happening here

rohankmr414 avatar Jul 13 '22 14:07 rohankmr414

=== RUN   TestDaemonPodSuite/TestMarkDaemonedPodSucceeded
Submitting workflow  daemoned-pod-completed-
Waiting 1m0s for workflow metadata.name=daemoned-pod-completed-x6jsz
 ? daemoned-pod-completed-x6jsz Workflow 0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ◷ daemoned                     Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ◷ daemoned                     Pod      0s      PodInitializing

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ◷ daemon-dependent             Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ◷ daemon-dependent             Pod      0s      PodInitializing

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ● daemon-dependent             Pod      0s      

 ● daemoned-pod-completed-x6jsz   Workflow 0s      
 └ ● daemoned-pod-completed-x6jsz DAG      0s      
 └ ● daemoned                     Pod      3s      
 └ ● daemon-dependent             Pod      0s      

 ✔ daemoned-pod-completed-x6jsz   Workflow 9s      
 └ ✔ daemoned-pod-completed-x6jsz DAG      9s      
 └ ✔ daemoned                     Pod      3s      
 └ ✔ daemon-dependent             Pod      5s      

 ✔ daemoned-pod-completed-x6jsz   Workflow 9s      
 └ ✔ daemoned-pod-completed-x6jsz DAG      9s      
 └ ✔ daemoned                     Pod      3s      
 └ ✔ daemon-dependent             Pod      5s      

Condition "to be Succeeded" met after 9s
Checking expectation daemoned-pod-completed-x6jsz
daemoned-pod-completed-x6jsz : Succeeded 
=== PASS: DaemonPodSuite/TestMarkDaemonedPodSucceeded
    --- PASS: TestDaemonPodSuite/TestMarkDaemonedPodSucceeded (9.54s)
PASS

Screenshot from 2022-07-13 19-56-19

@sarabala1979 @juliev0 For some reason, the tests are passing on my local env with the finalizer, not sure what's happening here

Hmm, sometimes there are intermittent failures unfortunately. We have a bug logged for that (there's an idea that the issue is caused by some overall timeout being too short). Can you push an "empty commit" (i.e. add a space or line or something) to re-trigger the tests? Then you can see if the same tests are failing every time or are intermittent? (Sorry for the trouble)

juliev0 avatar Jul 13 '22 16:07 juliev0

@juliev0 actually I did a rebase and force push which re-triggered the tests, the same tests are failing again. I think the reason behind this is that in some tests pods are being terminated and they are being stuck in a terminating state due to the finalizer which somehow prevents the next test from being run. It happened in my local env as well. This is after running make test-functional E2E_SUITE_TIMEOUT=15m STATIC_FILES=false after the test TestDaemonTemplateRef which prevents TestMarkDaemonedPodSucceeded from being run in https://github.com/argoproj/argo-workflows/runs/7320582957?check_suite_focus=true Screenshot from 2022-07-13 22-11-45 Screenshot from 2022-07-13 22-12-08

If I remove the finalizer manually, after which the stuck pods are terminated, the tests are running fine.

rohankmr414 avatar Jul 13 '22 16:07 rohankmr414

Oh, so it was after you did a rebase, meaning some recent commit in master caused it to change for you? Sorry about that.

When you say "some test pods are being terminated" do you mean the Controller is not the one deleting them and therefore they're not going through your logic to remove the finalizer? Unfortunately, I'm not an expert on those tests. Can you investigate and modify the tests if needed?

juliev0 avatar Jul 13 '22 18:07 juliev0

@juliev0 the error is not because of the rebase, ig it is because of some bug in pod deletion logic

rohankmr414 avatar Jul 13 '22 18:07 rohankmr414

@juliev0 the error is not because of the rebase, ig it is because of some bug in pod deletion logic

Okay, thanks for the clarification. If you are able to investigate the bug that would be great.

juliev0 avatar Jul 13 '22 19:07 juliev0

@juliev0 @sarabala1979 the issue occurs when running multiple tests, some pods are being terminated without being labelled as completed from the previous test, preventing the next test from being run. On running the tests individually the pods are being labelled as completed successfully and getting cleaned up later, not really sure what's happening here

rohankmr414 avatar Jul 24 '22 18:07 rohankmr414

@juliev0 @sarabala1979 the issue occurs when running multiple tests, some pods are being terminated without being labelled as completed from the previous test, preventing the next test from being run. On running the tests individually the pods are being labelled as completed successfully and getting cleaned up later, not really sure what's happening here

Sorry for the lack of response. I'm new to the project so don't have any insider information. :) It looks like you've made some recent commits - are you able to address this issue of the test?

juliev0 avatar Jul 26 '22 22:07 juliev0

I ran these tests on local. They seem to be working. I don't know about the k8s instance that gets spun up to run the tests, but I'm wondering if there are some parameters that we can play around with to eliminate the flakiness. Here's the outputs on my docker desktop k8s cluster:

➜  argo-workflows git:(rohankmr414-finalizer) ✗ go test -v -run=PodCleanupSuite/TestOnWorkflowCompletionLabelNotMatch --tags=functional --count=1 ./test/e2e/pod_cleanup_test.go
=== RUN   TestPodCleanupSuite
=== RUN   TestPodCleanupSuite/TestOnWorkflowCompletionLabelNotMatch
Submitting workflow  test-pod-cleanup-on-workflow-completion-label-not-match-
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Pending
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Pending
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Pending
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Pending
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Running
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Running
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Running
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Failed
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Complete
Pod condition met
=== PASS: PodCleanupSuite/TestOnWorkflowCompletionLabelNotMatch
--- PASS: TestPodCleanupSuite (7.72s)
    --- PASS: TestPodCleanupSuite/TestOnWorkflowCompletionLabelNotMatch (7.69s)
PASS
ok  	command-line-arguments	8.465s
➜  argo-workflows git:(rohankmr414-finalizer) ✗ go test -v -run=ClusterWorkflowTemplateSuite/TestNestedClusterWorkflowTemplate --tags=functional --count=1 ./test/e2e/...
=== RUN   TestClusterWorkflowTemplateSuite
=== RUN   TestClusterWorkflowTemplateSuite/TestNestedClusterWorkflowTemplate
Creating cluster workflow template cluster-workflow-template-nested-template
Creating cluster workflow template cluster-workflow-template-whalesay-template
Submitting workflow  cwft-wf-
Waiting 1m0s for workflow metadata.name=cwft-wf-nk6cs
 ? cwft-wf-nk6cs Workflow 0s

 ● cwft-wf-nk6cs            Workflow  0s
 └ ◷ step                   Pod       0s
 └ ● step                   Steps     0s
 └ ● cwft-wf-nk6cs          Steps     0s
 └ ● [0]                    StepGroup 0s
 └ ● call-whalesay-template Steps     0s
 └ ● [0]                    StepGroup 0s
 └ ● [0]                    StepGroup 0s

 ● cwft-wf-nk6cs            Workflow  0s
 └ ● [0]                    StepGroup 0s
 └ ◷ step                   Pod       0s      PodInitializing
 └ ● step                   Steps     0s
 └ ● cwft-wf-nk6cs          Steps     0s
 └ ● [0]                    StepGroup 0s
 └ ● call-whalesay-template Steps     0s
 └ ● [0]                    StepGroup 0s

 ● cwft-wf-nk6cs            Workflow  0s
 └ ● step                   Steps     0s
 └ ● cwft-wf-nk6cs          Steps     0s
 └ ● [0]                    StepGroup 0s
 └ ● call-whalesay-template Steps     0s
 └ ● [0]                    StepGroup 0s
 └ ● [0]                    StepGroup 0s
 └ ● step                   Pod       0s

 ● cwft-wf-nk6cs            Workflow  0s
 └ ● step                   Pod       0s
 └ ● step                   Steps     0s
 └ ● cwft-wf-nk6cs          Steps     0s
 └ ● [0]                    StepGroup 0s
 └ ● call-whalesay-template Steps     0s
 └ ● [0]                    StepGroup 0s
 └ ● [0]                    StepGroup 0s

 ✔ cwft-wf-nk6cs            Workflow  7s
 └ ✔ step                   Steps     7s
 └ ✔ cwft-wf-nk6cs          Steps     7s
 └ ✔ [0]                    StepGroup 7s
 └ ✔ call-whalesay-template Steps     7s
 └ ✔ [0]                    StepGroup 7s
 └ ✔ [0]                    StepGroup 7s
 └ ✔ step                   Pod       3s

Condition "to be Succeeded" met after 7s
=== PASS: ClusterWorkflowTemplateSuite/TestNestedClusterWorkflowTemplate
--- PASS: TestClusterWorkflowTemplateSuite (8.84s)
    --- PASS: TestClusterWorkflowTemplateSuite/TestNestedClusterWorkflowTemplate (8.81s)
PASS
ok  	github.com/argoproj/argo-workflows/v3/test/e2e	9.557s
?   	github.com/argoproj/argo-workflows/v3/test/e2e/fixtures	[no test files]

JPZ13 avatar Aug 11 '22 18:08 JPZ13

cc @sarabala1979

terrytangyuan avatar Aug 16 '22 17:08 terrytangyuan

I was able to reproduce a stuck test locally and what I see is that we're basically stuck in this for loop forever. The issue is that the finalizer-removal logic happens too early, before the pod exists. So, what we could do is add an "else" for that "if statement" in the for loop, which removes any finalizers for objects that still exist (and I believe that can apply to any type of Object since as far as I can tell, Finalizers apply to any type)

juliev0 avatar Sep 29 '22 17:09 juliev0

This appears to be WIP. Converted to draft.

alexec avatar Oct 07 '22 15:10 alexec

I think you're right about deleting finalizers in the e2e tests. I need to look at something else, can I pass this back to you?

alexec avatar Oct 10 '22 20:10 alexec

when is this being released?

tooptoop4 avatar Nov 10 '22 05:11 tooptoop4

🦗

tooptoop4 avatar Dec 06 '22 18:12 tooptoop4

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Dec 31 '22 22:12 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jan 21 '23 05:01 stale[bot]

This PR is in draft mode. Reviewers only look at PRs that are marked “Ready for review”. Please update PR if you would like it reviewed.

alexec avatar Feb 12 '23 16:02 alexec

@rohankmr414 is it still draft?

tooptoop4 avatar Feb 13 '23 12:02 tooptoop4

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar May 21 '23 16:05 stale[bot]

unstale

tooptoop4 avatar May 21 '23 21:05 tooptoop4

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jun 18 '23 00:06 stale[bot]

hello! has anyone been looking at finalizing this PR? we run Kubeflow and have been experiencing some issues with pods being deleted by pod garbage collector before workflow controller marks the run as successful

aubrianna-zhu avatar Jun 21 '23 22:06 aubrianna-zhu

hey @rohankmr414 , do you plan on merging this anytime soon?

nurcan-sonmez avatar Jul 07 '23 17:07 nurcan-sonmez

Hey, we also face issues with the pod deletion in Kubeflow. @rohankmr414 if it is ready to review can you mark it as ready? Otherwise, can someone else take over? So that hard work does not go to waste.

jbgerth avatar Jul 18 '23 14:07 jbgerth

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Aug 12 '23 13:08 stale[bot]

We're also waiting for this implementation to be released. Is there anything we could do to help progress this PR?

cc @rohankmr414 , @dpadhiar , @juliev0 , @sarabala1979

e-backmark-ericsson avatar Aug 18 '23 06:08 e-backmark-ericsson

I believe the failing tests are flaky, they are not related to my change and are working on my local environment

rohankmr414 avatar Sep 24 '23 11:09 rohankmr414