argo-workflows
argo-workflows copied to clipboard
feat: add finalizer to workflow pod to prevent "pod deleted". Fixes #8783
@rohankmr414 could you write a test case or add an example of this to your PR description?
@juliev0 @sarabala1979 the e2e tests for this PR are all failing with 15m0 timeout, something to note I assume
@rohankmr414 Regarding the test failures, we should determine if they're "real" or intermittently failing. Have you tried to run them locally? You should be able to do "make [testname]" like "make TestMarkDaemonedPodSucceeded"
@rohankmr414 It is a nice fix. Please add unit test and e2e test. Fix the test failures.
=== RUN TestDaemonPodSuite/TestMarkDaemonedPodSucceeded
Submitting workflow daemoned-pod-completed-
Waiting 1m0s for workflow metadata.name=daemoned-pod-completed-x6jsz
? daemoned-pod-completed-x6jsz Workflow 0s
● daemoned-pod-completed-x6jsz Workflow 0s
└ ● daemoned-pod-completed-x6jsz DAG 0s
└ ◷ daemoned Pod 0s
● daemoned-pod-completed-x6jsz Workflow 0s
└ ● daemoned-pod-completed-x6jsz DAG 0s
└ ◷ daemoned Pod 0s PodInitializing
● daemoned-pod-completed-x6jsz Workflow 0s
└ ● daemoned-pod-completed-x6jsz DAG 0s
└ ● daemoned Pod 3s
└ ◷ daemon-dependent Pod 0s
● daemoned-pod-completed-x6jsz Workflow 0s
└ ● daemoned-pod-completed-x6jsz DAG 0s
└ ● daemoned Pod 3s
└ ◷ daemon-dependent Pod 0s PodInitializing
● daemoned-pod-completed-x6jsz Workflow 0s
└ ● daemoned-pod-completed-x6jsz DAG 0s
└ ● daemoned Pod 3s
└ ● daemon-dependent Pod 0s
● daemoned-pod-completed-x6jsz Workflow 0s
└ ● daemoned-pod-completed-x6jsz DAG 0s
└ ● daemoned Pod 3s
└ ● daemon-dependent Pod 0s
✔ daemoned-pod-completed-x6jsz Workflow 9s
└ ✔ daemoned-pod-completed-x6jsz DAG 9s
└ ✔ daemoned Pod 3s
└ ✔ daemon-dependent Pod 5s
✔ daemoned-pod-completed-x6jsz Workflow 9s
└ ✔ daemoned-pod-completed-x6jsz DAG 9s
└ ✔ daemoned Pod 3s
└ ✔ daemon-dependent Pod 5s
Condition "to be Succeeded" met after 9s
Checking expectation daemoned-pod-completed-x6jsz
daemoned-pod-completed-x6jsz : Succeeded
=== PASS: DaemonPodSuite/TestMarkDaemonedPodSucceeded
--- PASS: TestDaemonPodSuite/TestMarkDaemonedPodSucceeded (9.54s)
PASS
@sarabala1979 @juliev0 For some reason, the tests are passing on my local env with the finalizer, not sure what's happening here
=== RUN TestDaemonPodSuite/TestMarkDaemonedPodSucceeded Submitting workflow daemoned-pod-completed- Waiting 1m0s for workflow metadata.name=daemoned-pod-completed-x6jsz ? daemoned-pod-completed-x6jsz Workflow 0s ● daemoned-pod-completed-x6jsz Workflow 0s └ ● daemoned-pod-completed-x6jsz DAG 0s └ ◷ daemoned Pod 0s ● daemoned-pod-completed-x6jsz Workflow 0s └ ● daemoned-pod-completed-x6jsz DAG 0s └ ◷ daemoned Pod 0s PodInitializing ● daemoned-pod-completed-x6jsz Workflow 0s └ ● daemoned-pod-completed-x6jsz DAG 0s └ ● daemoned Pod 3s └ ◷ daemon-dependent Pod 0s ● daemoned-pod-completed-x6jsz Workflow 0s └ ● daemoned-pod-completed-x6jsz DAG 0s └ ● daemoned Pod 3s └ ◷ daemon-dependent Pod 0s PodInitializing ● daemoned-pod-completed-x6jsz Workflow 0s └ ● daemoned-pod-completed-x6jsz DAG 0s └ ● daemoned Pod 3s └ ● daemon-dependent Pod 0s ● daemoned-pod-completed-x6jsz Workflow 0s └ ● daemoned-pod-completed-x6jsz DAG 0s └ ● daemoned Pod 3s └ ● daemon-dependent Pod 0s ✔ daemoned-pod-completed-x6jsz Workflow 9s └ ✔ daemoned-pod-completed-x6jsz DAG 9s └ ✔ daemoned Pod 3s └ ✔ daemon-dependent Pod 5s ✔ daemoned-pod-completed-x6jsz Workflow 9s └ ✔ daemoned-pod-completed-x6jsz DAG 9s └ ✔ daemoned Pod 3s └ ✔ daemon-dependent Pod 5s Condition "to be Succeeded" met after 9s Checking expectation daemoned-pod-completed-x6jsz daemoned-pod-completed-x6jsz : Succeeded === PASS: DaemonPodSuite/TestMarkDaemonedPodSucceeded --- PASS: TestDaemonPodSuite/TestMarkDaemonedPodSucceeded (9.54s) PASS
@sarabala1979 @juliev0 For some reason, the tests are passing on my local env with the finalizer, not sure what's happening here
Hmm, sometimes there are intermittent failures unfortunately. We have a bug logged for that (there's an idea that the issue is caused by some overall timeout being too short). Can you push an "empty commit" (i.e. add a space or line or something) to re-trigger the tests? Then you can see if the same tests are failing every time or are intermittent? (Sorry for the trouble)
@juliev0 actually I did a rebase and force push which re-triggered the tests, the same tests are failing again. I think the reason behind this is that in some tests pods are being terminated and they are being stuck in a terminating state due to the finalizer which somehow prevents the next test from being run. It happened in my local env as well.
This is after running make test-functional E2E_SUITE_TIMEOUT=15m STATIC_FILES=false
after the test TestDaemonTemplateRef
which prevents TestMarkDaemonedPodSucceeded
from being run in https://github.com/argoproj/argo-workflows/runs/7320582957?check_suite_focus=true
If I remove the finalizer manually, after which the stuck pods are terminated, the tests are running fine.
Oh, so it was after you did a rebase, meaning some recent commit in master caused it to change for you? Sorry about that.
When you say "some test pods are being terminated" do you mean the Controller is not the one deleting them and therefore they're not going through your logic to remove the finalizer? Unfortunately, I'm not an expert on those tests. Can you investigate and modify the tests if needed?
@juliev0 the error is not because of the rebase, ig it is because of some bug in pod deletion logic
@juliev0 the error is not because of the rebase, ig it is because of some bug in pod deletion logic
Okay, thanks for the clarification. If you are able to investigate the bug that would be great.
@juliev0 @sarabala1979 the issue occurs when running multiple tests, some pods are being terminated without being labelled as completed from the previous test, preventing the next test from being run. On running the tests individually the pods are being labelled as completed successfully and getting cleaned up later, not really sure what's happening here
@juliev0 @sarabala1979 the issue occurs when running multiple tests, some pods are being terminated without being labelled as completed from the previous test, preventing the next test from being run. On running the tests individually the pods are being labelled as completed successfully and getting cleaned up later, not really sure what's happening here
Sorry for the lack of response. I'm new to the project so don't have any insider information. :) It looks like you've made some recent commits - are you able to address this issue of the test?
I ran these tests on local. They seem to be working. I don't know about the k8s instance that gets spun up to run the tests, but I'm wondering if there are some parameters that we can play around with to eliminate the flakiness. Here's the outputs on my docker desktop k8s cluster:
➜ argo-workflows git:(rohankmr414-finalizer) ✗ go test -v -run=PodCleanupSuite/TestOnWorkflowCompletionLabelNotMatch --tags=functional --count=1 ./test/e2e/pod_cleanup_test.go
=== RUN TestPodCleanupSuite
=== RUN TestPodCleanupSuite/TestOnWorkflowCompletionLabelNotMatch
Submitting workflow test-pod-cleanup-on-workflow-completion-label-not-match-
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Pending
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Pending
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Pending
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Pending
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Running
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Running
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Running
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Failed
pod test-pod-cleanup-on-workflow-completion-label-not-match-tx6pv: Complete
Pod condition met
=== PASS: PodCleanupSuite/TestOnWorkflowCompletionLabelNotMatch
--- PASS: TestPodCleanupSuite (7.72s)
--- PASS: TestPodCleanupSuite/TestOnWorkflowCompletionLabelNotMatch (7.69s)
PASS
ok command-line-arguments 8.465s
➜ argo-workflows git:(rohankmr414-finalizer) ✗ go test -v -run=ClusterWorkflowTemplateSuite/TestNestedClusterWorkflowTemplate --tags=functional --count=1 ./test/e2e/...
=== RUN TestClusterWorkflowTemplateSuite
=== RUN TestClusterWorkflowTemplateSuite/TestNestedClusterWorkflowTemplate
Creating cluster workflow template cluster-workflow-template-nested-template
Creating cluster workflow template cluster-workflow-template-whalesay-template
Submitting workflow cwft-wf-
Waiting 1m0s for workflow metadata.name=cwft-wf-nk6cs
? cwft-wf-nk6cs Workflow 0s
● cwft-wf-nk6cs Workflow 0s
└ ◷ step Pod 0s
└ ● step Steps 0s
└ ● cwft-wf-nk6cs Steps 0s
└ ● [0] StepGroup 0s
└ ● call-whalesay-template Steps 0s
└ ● [0] StepGroup 0s
└ ● [0] StepGroup 0s
● cwft-wf-nk6cs Workflow 0s
└ ● [0] StepGroup 0s
└ ◷ step Pod 0s PodInitializing
└ ● step Steps 0s
└ ● cwft-wf-nk6cs Steps 0s
└ ● [0] StepGroup 0s
└ ● call-whalesay-template Steps 0s
└ ● [0] StepGroup 0s
● cwft-wf-nk6cs Workflow 0s
└ ● step Steps 0s
└ ● cwft-wf-nk6cs Steps 0s
└ ● [0] StepGroup 0s
└ ● call-whalesay-template Steps 0s
└ ● [0] StepGroup 0s
└ ● [0] StepGroup 0s
└ ● step Pod 0s
● cwft-wf-nk6cs Workflow 0s
└ ● step Pod 0s
└ ● step Steps 0s
└ ● cwft-wf-nk6cs Steps 0s
└ ● [0] StepGroup 0s
└ ● call-whalesay-template Steps 0s
└ ● [0] StepGroup 0s
└ ● [0] StepGroup 0s
✔ cwft-wf-nk6cs Workflow 7s
└ ✔ step Steps 7s
└ ✔ cwft-wf-nk6cs Steps 7s
└ ✔ [0] StepGroup 7s
└ ✔ call-whalesay-template Steps 7s
└ ✔ [0] StepGroup 7s
└ ✔ [0] StepGroup 7s
└ ✔ step Pod 3s
Condition "to be Succeeded" met after 7s
=== PASS: ClusterWorkflowTemplateSuite/TestNestedClusterWorkflowTemplate
--- PASS: TestClusterWorkflowTemplateSuite (8.84s)
--- PASS: TestClusterWorkflowTemplateSuite/TestNestedClusterWorkflowTemplate (8.81s)
PASS
ok github.com/argoproj/argo-workflows/v3/test/e2e 9.557s
? github.com/argoproj/argo-workflows/v3/test/e2e/fixtures [no test files]
cc @sarabala1979
I was able to reproduce a stuck test locally and what I see is that we're basically stuck in this for loop forever. The issue is that the finalizer-removal logic happens too early, before the pod exists. So, what we could do is add an "else" for that "if statement" in the for loop, which removes any finalizers for objects that still exist (and I believe that can apply to any type of Object since as far as I can tell, Finalizers apply to any type)
This appears to be WIP. Converted to draft.
I think you're right about deleting finalizers in the e2e tests. I need to look at something else, can I pass this back to you?
when is this being released?
🦗
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
This PR is in draft mode. Reviewers only look at PRs that are marked “Ready for review”. Please update PR if you would like it reviewed.
@rohankmr414 is it still draft?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
unstale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
hello! has anyone been looking at finalizing this PR? we run Kubeflow and have been experiencing some issues with pods being deleted by pod garbage collector before workflow controller marks the run as successful
hey @rohankmr414 , do you plan on merging this anytime soon?
Hey, we also face issues with the pod deletion in Kubeflow. @rohankmr414 if it is ready to review can you mark it as ready? Otherwise, can someone else take over? So that hard work does not go to waste.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
We're also waiting for this implementation to be released. Is there anything we could do to help progress this PR?
cc @rohankmr414 , @dpadhiar , @juliev0 , @sarabala1979
I believe the failing tests are flaky, they are not related to my change and are working on my local environment