argo-workflows
argo-workflows copied to clipboard
Intermittent failures in CI E2E tests
The CI end to end tests often fail, but then pass after an empty commit is added. We need to determine if for each failure the issue is the test or an actual race condition in the code that behaves differently each time.
We can add new occurrences over time here:
-
test-functional, minimal
-
test: TestSubmitWorkflowTemplateWithEnum
-
what happened: panic: test timed out after 15m0s
-
link: https://github.com/argoproj/argo-workflows/runs/7011970939?check_suite_focus=true
-
test: TestParametrizableAds
-
what happened: Error: "" does not contain "Pod was active on the node longer than the specified deadline"
-
link: https://github.com/argoproj/argo-workflows/runs/7332294820?check_suite_focus=true
-
test: AgentSuite/TestParallel
-
what happened: line 67: "Should be true"
-
link: https://github.com/argoproj/argo-workflows/runs/7698020976?check_suite_focus=true
-
link: https://github.com/argoproj/argo-workflows/runs/7736452278?check_suite_focus=true
-
-
test-cli, mysql
-
test: TestCLISuite/TestNodeSuspendResume
-
what happened: timeout after 1m at WaitForWorkflow()
-
link: https://github.com/argoproj/argo-workflows/runs/7365382562?check_suite_focus=true
-
test: TestCLISuite/TestWorkflowRetry
-
what happened: failure at: assert.True(t, retryTime.Before(&innerStepsPodNode.FinishedAt)), like 866
-
link: https://github.com/argoproj/argo-workflows/runs/7434857876?check_suite_focus=true
-
-
test-executor, minimal
-
test: N/A
-
what happened: no test ever got run; timed out after 24m in the "actions/cache@v3" step
-
link: https://github.com/argoproj/argo-workflows/runs/7435503575?check_suite_focus=true
-
official issue - https://github.com/actions/cache/issues/810
-
test: SignalsSuite/TestStopBehavior
-
what happened: signals_test.go:34: timeout after 1m40s waiting for condition
-
link: https://github.com/argoproj/argo-workflows/runs/7459797029?check_suite_focus=true
-
-test-examples, minimal - test: examples/arguments-parameters-from-configmap.yaml - what happened: error: timed out waiting for the condition on workflows/conditional-artifacts-svhsv - link: https://github.com/argoproj/argo-workflows/runs/7643995231?check_suite_focus=true
-test-api, example - step: "make wait" - what happened: "the action has timed out" - link: https://github.com/argoproj/argo-workflows/runs/7660935659?check_suite_focus=true
I have experienced this recently. So here's the datapoint
- test-functional, minimal
- test: TestSubmitWorkflowTemplateWithEnum
- what happened: panic: test timed out after 15m0s
- link: https://github.com/argoproj/argo-workflows/runs/6969660302?check_suite_focus=true#step:16:1
The timeout seems to be set at around 15m into the running of Run make test-functional E2E_TIMEOUT=1m STATIC_FILES=false
. When it passes it seems to have gone for just a tiny bit less time (like a 2 second difference).
It may or may not be significant but the whole CI job that passed was 30s under the 20m mark, whereas the run that failed was 3s over 20m.
@ezk84 Thanks for the data point!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
still hoping to address this when I have time if nobody else does, so keeping it alive
As an update, the overall GitHub Action timeout for each e2e test was increased from 20m to 25m today, and the timeout passed into "go test" for the Test Suite run by the Action was increased from 15m to 20m (this PR). This should take care of some of the failures, although ultimately we need to address the issue of why the build is so slow.
Also, it appears that individual tests can sometimes timeout as well (like this one).
Here the one more test case https://github.com/argoproj/argo-workflows/runs/7816710269?check_suite_focus=true FAIL: TestCLISuite/TestLogProblems (27.56s) === RUN TestCLISuite/TestLogProblems Submitting workflow log-problems- Waiting 1m0s for workflow metadata.name=log-problems-4d4nk ? log-problems-4d4nk Workflow 0s
● log-problems-4d4nk Workflow 0s
└ ● [0] StepGroup 0s
└ ● log-problems-4d4nk Steps 0s
└ ◷ report-1 Pod 0s
Condition "to start" met after 0s ../../dist/argo -n argo logs @latest --follow
Going to take a look at each of these test cases and see if there is any common cause or otherwise.
TestParametrizableAds should have been addressed in https://github.com/argoproj/argo-workflows/commit/57bac335afac2c28a4eb5ccf1fa97bb5bba63e97 with an increase in time for WaitForWorkflow()
TestParametrizableAds should have been addressed in https://github.com/argoproj/argo-workflows/commit/57bac335afac2c28a4eb5ccf1fa97bb5bba63e97 with an increase in time for WaitForWorkflow()
Hmm, but it looks like that commit occurred on 7/11, while the test occurred on 7/13 so unfortunately I don't think that fixed it, right?
TestParametrizableAds should have been addressed in 57bac33 with an increase in time for WaitForWorkflow()
Hmm, but it looks like that commit occurred on 7/11, while the test occurred on 7/13 so unfortunately I don't think that fixed it, right?
That looks correct unfortunately. Will have to investigate this test case again.
I started a document (accessible by anyone at Intuit) which starts to go into some root causes.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Not stale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
not stale
Yeah, this is not stale. What is? Also: https://drewdevault.com/2021/10/26/stalebot.html
TestArtifactGC is apparently flakey. If somebody sees this please include a link to the CI run.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Not stale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Not stale
Not stale