argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Intermittent failures in CI E2E tests

Open juliev0 opened this issue 2 years ago • 10 comments

The CI end to end tests often fail, but then pass after an empty commit is added. We need to determine if for each failure the issue is the test or an actual race condition in the code that behaves differently each time.

We can add new occurrences over time here:

  • test-functional, minimal

    • test: TestSubmitWorkflowTemplateWithEnum

    • what happened: panic: test timed out after 15m0s

    • link: https://github.com/argoproj/argo-workflows/runs/7011970939?check_suite_focus=true

    • test: TestParametrizableAds

    • what happened: Error: "" does not contain "Pod was active on the node longer than the specified deadline"

    • link: https://github.com/argoproj/argo-workflows/runs/7332294820?check_suite_focus=true

    • test: AgentSuite/TestParallel

    • what happened: line 67: "Should be true"

    • link: https://github.com/argoproj/argo-workflows/runs/7698020976?check_suite_focus=true

    • link: https://github.com/argoproj/argo-workflows/runs/7736452278?check_suite_focus=true

  • test-cli, mysql

    • test: TestCLISuite/TestNodeSuspendResume

    • what happened: timeout after 1m at WaitForWorkflow()

    • link: https://github.com/argoproj/argo-workflows/runs/7365382562?check_suite_focus=true

    • test: TestCLISuite/TestWorkflowRetry

    • what happened: failure at: assert.True(t, retryTime.Before(&innerStepsPodNode.FinishedAt)), like 866

    • link: https://github.com/argoproj/argo-workflows/runs/7434857876?check_suite_focus=true

  • test-executor, minimal

    • test: N/A

    • what happened: no test ever got run; timed out after 24m in the "actions/cache@v3" step

    • link: https://github.com/argoproj/argo-workflows/runs/7435503575?check_suite_focus=true

    • official issue - https://github.com/actions/cache/issues/810

    • test: SignalsSuite/TestStopBehavior

    • what happened: signals_test.go:34: timeout after 1m40s waiting for condition

    • link: https://github.com/argoproj/argo-workflows/runs/7459797029?check_suite_focus=true

-test-examples, minimal - test: examples/arguments-parameters-from-configmap.yaml - what happened: error: timed out waiting for the condition on workflows/conditional-artifacts-svhsv - link: https://github.com/argoproj/argo-workflows/runs/7643995231?check_suite_focus=true

-test-api, example - step: "make wait" - what happened: "the action has timed out" - link: https://github.com/argoproj/argo-workflows/runs/7660935659?check_suite_focus=true

juliev0 avatar Jun 22 '22 21:06 juliev0

I have experienced this recently. So here's the datapoint

The timeout seems to be set at around 15m into the running of Run make test-functional E2E_TIMEOUT=1m STATIC_FILES=false. When it passes it seems to have gone for just a tiny bit less time (like a 2 second difference).

It may or may not be significant but the whole CI job that passed was 30s under the 20m mark, whereas the run that failed was 3s over 20m.

ezk84 avatar Jun 24 '22 10:06 ezk84

@ezk84 Thanks for the data point!

juliev0 avatar Jun 24 '22 17:06 juliev0

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jul 10 '22 06:07 stale[bot]

still hoping to address this when I have time if nobody else does, so keeping it alive

juliev0 avatar Jul 10 '22 14:07 juliev0

As an update, the overall GitHub Action timeout for each e2e test was increased from 20m to 25m today, and the timeout passed into "go test" for the Test Suite run by the Action was increased from 15m to 20m (this PR). This should take care of some of the failures, although ultimately we need to address the issue of why the build is so slow.

Also, it appears that individual tests can sometimes timeout as well (like this one).

juliev0 avatar Jul 16 '22 04:07 juliev0

Here the one more test case https://github.com/argoproj/argo-workflows/runs/7816710269?check_suite_focus=true FAIL: TestCLISuite/TestLogProblems (27.56s) === RUN TestCLISuite/TestLogProblems Submitting workflow log-problems- Waiting 1m0s for workflow metadata.name=log-problems-4d4nk ? log-problems-4d4nk Workflow 0s

● log-problems-4d4nk Workflow 0s
└ ● [0] StepGroup 0s
└ ● log-problems-4d4nk Steps 0s
└ ◷ report-1 Pod 0s

Condition "to start" met after 0s ../../dist/argo -n argo logs @latest --follow

sarabala1979 avatar Aug 15 '22 14:08 sarabala1979

Going to take a look at each of these test cases and see if there is any common cause or otherwise.

TestParametrizableAds should have been addressed in https://github.com/argoproj/argo-workflows/commit/57bac335afac2c28a4eb5ccf1fa97bb5bba63e97 with an increase in time for WaitForWorkflow()

dpadhiar avatar Sep 07 '22 23:09 dpadhiar

TestParametrizableAds should have been addressed in https://github.com/argoproj/argo-workflows/commit/57bac335afac2c28a4eb5ccf1fa97bb5bba63e97 with an increase in time for WaitForWorkflow()

Hmm, but it looks like that commit occurred on 7/11, while the test occurred on 7/13 so unfortunately I don't think that fixed it, right?

juliev0 avatar Sep 08 '22 04:09 juliev0

TestParametrizableAds should have been addressed in 57bac33 with an increase in time for WaitForWorkflow()

Hmm, but it looks like that commit occurred on 7/11, while the test occurred on 7/13 so unfortunately I don't think that fixed it, right?

That looks correct unfortunately. Will have to investigate this test case again.

dpadhiar avatar Sep 08 '22 16:09 dpadhiar

I started a document (accessible by anyone at Intuit) which starts to go into some root causes.

juliev0 avatar Sep 30 '22 22:09 juliev0

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Oct 29 '22 05:10 stale[bot]

Not stale

juliev0 avatar Oct 29 '22 17:10 juliev0

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Nov 13 '22 04:11 stale[bot]

not stale

juliev0 avatar Nov 13 '22 15:11 juliev0

Yeah, this is not stale. What is? Also: https://drewdevault.com/2021/10/26/stalebot.html

scravy avatar Nov 14 '22 01:11 scravy

TestArtifactGC is apparently flakey. If somebody sees this please include a link to the CI run.

juliev0 avatar Dec 14 '22 21:12 juliev0

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Dec 31 '22 22:12 stale[bot]

Not stale

juliev0 avatar Jan 01 '23 01:01 juliev0

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jan 21 '23 20:01 stale[bot]

Not stale

juliev0 avatar Jan 21 '23 23:01 juliev0

Not stale

vosferatu avatar Jun 15 '23 07:06 vosferatu