argo-workflows
argo-workflows copied to clipboard
Flakey tests
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest - [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
Unit tests failed: https://github.com/argoproj/argo-workflows/actions/runs/4592799925/jobs/8110154390
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
See recent CI builds
Logs from the workflow controller
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
Logs from in your workflow's wait container
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
This is caused by #10768 - Until kit is fixed or reverted out, this will continue to happen.
From your unit test logs:
curl -q https://raw.githubusercontent.com/kitproj/kit/main/install.sh | sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 510 100 510 0 0 990 0 --:--:-- --:--:-- --:--:-- 992
+ curl --retry 99 -vfsL https://api.github.com/repos/kitproj/kit/releases/latest -o /tmp/latest
* Trying 192.30.255.116:443...
* Connected to api.github.com (192.30.255.116) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* CAfile: /etc/ssl/certs/ca-certificates.crt
...
< date: Mon, 03 Apr 2023 04:15:11 GMT
< server: Varnish
< strict-transport-security: max-age=31536000; includeSubdomains; preload
< x-content-type-options: nosniff
< x-frame-options: deny
< x-xss-protection: 1; mode=block
< content-security-policy: default-src 'none'; style-src 'unsafe-inline'
< access-control-allow-origin: *
< access-control-expose-headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
< content-type: application/json; charset=utf-8
< referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
< x-github-media-type: github.v3; format=json
< x-ratelimit-limit: 60
< x-ratelimit-remaining: 0
< x-ratelimit-reset: 1680495768
< x-ratelimit-resource: core
< x-ratelimit-used: 60
< content-length: 278
< x-github-request-id: F380:89D7:380A122:3A25002:642A52CF
* The requested URL returned error: 403
* stopped the pause stream!
* Connection #0 to host api.github.com left intact
make: *** [Makefile:464: kit] Error 22
Error: Process completed with exit code 2.
Here's the log: https://pipelines.actions.githubusercontent.com/serviceHosts/49efa180-38ba-4f73-8389-f407aa841894/_apis/pipelines/1/runs/34951/signedlogcontent/2?urlExpires=2023-04-04T12%3A52%3A27.4837908Z&urlSigningMethod=HMACV1&urlSignature=7YLqPxOPpolP4cnkttw6DW97ZgvYY6ui%2Fs438NRwSMY%3D
https://github.com/argoproj/argo-workflows/actions/runs/4592799925/jobs/8110154390
It's not related to kit. Are you looking at somewhere else?
I was looking at the failed e2e test, apologies.
- [x] https://github.com/argoproj/argo-workflows/pull/11064
Also flaky e2e test test-executor https://github.com/argoproj/argo-workflows/actions/runs/4608012677/jobs/8143253843:
Condition "to have running pod" met after 6s
Waiting 1m40s for workflow metadata.name=stop-terminate-dktw8
● stop-terminate-dktw8 Workflow 0s
└ ● stop-terminate-dktw8 DAG 0s
└ ● A Pod 0s
signals_test.go:78: timeout after 1m40s waiting for condition
Checking expectation stop-terminate-dktw8
stop-terminate-dktw8 : Running
signals_test.go:81:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:81
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:80
Error: []v1alpha1.WorkflowPhase{"Failed", "Error"} does not contain "Running"
Test: TestSignalsSuite/TestTerminateBehavior
signals_test.go:84:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:84
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:80
Error: []v1alpha1.NodePhase{"Failed", "Error"} does not contain "Running"
Test: TestSignalsSuite/TestTerminateBehavior
=== FAIL: SignalsSuite/TestTerminateBehavior
FAIL github.com/argoproj/argo-workflows/v3/test/e2e 535.374s
- [ ] TODO
Flaky test-cli https://github.com/argoproj/argo-workflows/actions/runs/4639608340/jobs/8210973276
../../dist/argo -n argo resume @latest --node-field-selector inputs.parameters.tag.value=suspend1-tag1
exit status 12023/04/07 16:47:26 Failed to resume @latest: rpc error: code = Internal desc = currently, set only targets suspend nodes: no suspend nodes matching nodeFieldSelector: inputs.parameters.tag.value=suspend1-tag1
cli_test.go:531:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:531
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:450
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:459
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:530
Error: Received unexpected error:
exit status 1
Test: TestCLISuite/TestNodeSuspendResume
Waiting 1m0s for workflow metadata.name=node-suspend-q4m95
● node-suspend-q4m95 Workflow 0s
└ ● node-suspend-q4m95 Steps 0s
└ ✔ step1 Pod 14s
└ ✔ [0] StepGroup 18s
└ ● [1] StepGroup 0s
└ ● suspend1 Suspend 0s
Condition "suspended node" met after 0s
../../dist/argo -n argo stop @latest --node-field-selector inputs.parameters.tag.value=suspend2-tag1 --message because
exit status 1time="2023-04-07T16:47:26.732Z" level=fatal msg="rpc error: code = Internal desc = currently, set only targets suspend nodes: no suspend nodes matching nodeFieldSelector: inputs.parameters.tag.value=suspend2-tag1"
cli_test.go:539:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:539
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:450
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:459
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:538
Error: Received unexpected error:
exit status 1
Test: TestCLISuite/TestNodeSuspendResume
Waiting 1m0s for workflow metadata.name=node-suspend-q4m95
● node-suspend-q4m95 Workflow 0s
└ ✔ step1 Pod 14s
└ ✔ [0] StepGroup 18s
└ ● node-suspend-q4m95 Steps 0s
└ ● suspend1 Suspend 0s
└ ● [1] StepGroup 0s
cli_test.go:543: timeout after 1m0s waiting for condition
Checking expectation node-suspend-q4m95
node-suspend-q4m95 : Running
cli_test.go:546:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:546
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:545
Error: Expect "" to match "child 'node-suspend-.*' failed"
Test: TestCLISuite/TestNodeSuspendResume
- [ ] TODO
test-cli: https://github.com/argoproj/argo-workflows/actions/runs/4663229967/jobs/8254446014
../../dist/argo -n argo retry retry-with-recreated-pvc
exit status 1time="2023-04-11T02:56:07.436Z" level=fatal msg="rpc error: code = InvalidArgument desc = workflow must be Failed/Error to retry"
cli_test.go:901:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:901
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:265
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:900
Error: Received unexpected error:
exit status 1
Test: TestCLISuite/TestWorkflowRetryWithRecreatedPVC
Messages: time="2023-04-11T02:56:07.436Z" level=fatal msg="rpc error: code = InvalidArgument desc = workflow must be Failed/Error to retry"
Waiting 1m0s for workflow metadata.name=retry-with-recreated-pvc
● retry-with-recreated-pvc Workflow 0s
└ ● retry-with-recreated-pvc Steps 0s
└ ● [0] StepGroup 0s
└ ◷ generate Pod 0s
cli_test.go:907: timeout after 1m0s waiting for condition
Checking expectation retry-with-recreated-pvc
retry-with-recreated-pvc : Running
suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference
goroutine 2333 [running]:
runtime/debug.Stack()
/opt/hostedtoolcache/go/1.19.7/x64/src/runtime/debug/stack.go:24 +0x65
github.com/stretchr/testify/suite.failOnPanic(0xc000901520, {0x19d8620, 0x2cc8ab0})
/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:87 +0x3b
github.com/stretchr/testify/suite.Run.func1.1()
/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:183 +0x252
panic({0x19d8620, 0x2cc8ab0})
/opt/hostedtoolcache/go/1.19.7/x64/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-workflows/v3/test/e2e.(*CLISuite).TestWorkflowRetryWithRecreatedPVC.func2(0x1ef6880?, 0xc00011a008?, 0xc0000ce328)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:910 +0x3a
github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).expectWorkflow(0xc00089f448, {0xc00080e3a8, 0x18}, 0x1d4d648)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68 +0x31f
github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).ExpectWorkflow(0xc00089f448, 0xc00089f418?)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43 +0x4f
github.com/argoproj/argo-workflows/v3/test/e2e.(*CLISuite).TestWorkflowRetryWithRecreatedPVC(0x0?)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:909 +0x508
reflect.Value.call({0xc0003dec60?, 0xc00033b458?, 0xc0002b9c00?}, {0x1c3ed66, 0x4}, {0xc00089fe70, 0x1, 0x17cde05?})
/opt/hostedtoolcache/go/1.19.7/x64/src/reflect/value.go:584 +0x8c5
reflect.Value.Call({0xc0003dec60?, 0xc00033b458?, 0xc00021af00?}, {0xc00089fe70?, 0x7fa4e4912258?, 0xd0?})
/opt/hostedtoolcache/go/1.19.7/x64/src/reflect/value.go:368 +0xbc
github.com/stretchr/testify/suite.Run.func1(0xc000901520)
/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:197 +0x4b6
testing.tRunner(0xc000901520, 0xc0004bce10)
/opt/hostedtoolcache/go/1.19.7/x64/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
/opt/hostedtoolcache/go/1.19.7/x64/src/testing/testing.go:1493 +0x35f
- [x] Hooks tests are failing: https://github.com/argoproj/argo-workflows/actions/runs/4670044854/jobs/8269390253?pr=10879
=== PASS: HooksSuite/TestTemplateLevelHooksStepSuccessVersion
suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference
goroutine 627 [running]:
runtime/debug.Stack()
/opt/hostedtoolcache/go/1.19.8/x64/src/runtime/debug/stack.go:24 +0x65
github.com/stretchr/testify/suite.failOnPanic(0xc000701040, {0x1a5f9a0, 0x2dfeab0})
/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:87 +0x3b
github.com/stretchr/testify/suite.Run.func1.1()
/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:183 +0x252
panic({0x1a5f9a0, 0x2dfeab0})
/opt/hostedtoolcache/go/1.19.8/x64/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-workflows/v3/test/e2e.(*HooksSuite).TestTemplateLevelHooksStepSuccessVersion.func9(0x1f8f620?, 0xc00011a008?, 0xc0004b4f40?)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/hooks_test.go:168 +0x19
github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).ExpectWorkflowNode.func1(0x1f8f620?, 0xc0007a8da0, 0xc0004b54b0?)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:110 +0x4c9
github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).expectWorkflow(0xc0004b5580, {0xc0007fe3e0, 0x1f}, 0xc0004b5520)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68 +0x31f
github.com/argoproj/argo-workflows/v3/test/e2e/fixtures.(*Then).ExpectWorkflowNode(0xc000347580?, 0xc000347570?, 0x1?)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:85 +0x51
github.com/argoproj/argo-workflows/v3/test/e2e.(*HooksSuite).TestTemplateLevelHooksStepSuccessVersion(0x0?)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/hooks_test.go:165 +0x325
reflect.Value.call({0xc0000a6fc0?, 0xc00011beb0?, 0x46bdf9?}, {0x1cd5aca, 0x4}, {0xc000347e70, 0x1, 0xc000666580?})
/opt/hostedtoolcache/go/1.19.8/x64/src/reflect/value.go:584 +0x8c5
reflect.Value.Call({0xc0000a6fc0?, 0xc00011beb0?, 0xc0004f6d00?}, {0xc000347e70?, 0x7f1cd42bdd00?, 0xd0?})
/opt/hostedtoolcache/go/1.19.8/x64/src/reflect/value.go:368 +0xbc
github.com/stretchr/testify/suite.Run.func1(0xc000701040)
/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:197 +0x4b6
testing.tRunner(0xc000701040, 0xc000778240)
/opt/hostedtoolcache/go/1.19.8/x64/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
/opt/hostedtoolcache/go/1.19.8/x64/src/testing/testing.go:1493 +0x35f
=== SLOW TEST: HooksSuite/TestTemplateLevelHooksDagFailVersion took 16s
=== SLOW TEST: HooksSuite/TestTemplateLevelHooksDagSuccessVersion took 30s
=== SLOW TEST: HooksSuite/TestTemplateLevelHooksStepFailVersion took 18s
=== SLOW TEST: HooksSuite/TestTemplateLevelHooksStepSuccessVersion took 41s
=== CONT TestHooksSuite
e2e_suite.go:86: to learn how to diagnose failed tests: https://argoproj.github.io/argo-workflows/running-locally/#running-e2e-tests-locally
--- FAIL: TestHooksSuite (107.05s)
--- PASS: TestHooksSuite/TestTemplateLevelHooksDagFailVersion (16.67s)
--- PASS: TestHooksSuite/TestTemplateLevelHooksDagSuccessVersion (30.26s)
--- PASS: TestHooksSuite/TestTemplateLevelHooksStepFailVersion (18.62s)
--- FAIL: TestHooksSuite/TestTemplateLevelHooksStepSuccessVersion (41.50s)
FAIL
FAIL github.com/argoproj/argo-workflows/v3/test/e2e 235.744s
FAIL
@GeunSam2 Would you like to take a look at this one? It seems pretty consistent. Another example https://github.com/argoproj/argo-workflows/actions/runs/4670440612/jobs/8270253757
Okay I'll check the reason of hooks test failing.
- [ ] TODO
../../dist/argo -n argo stop @latest --node-field-selector inputs.parameters.tag.value=suspend2-tag1 --message because
exit status 1time="2023-04-15T12:49:21.333Z" level=fatal msg="rpc error: code = Internal desc = currently, set only targets suspend nodes: no suspend nodes matching nodeFieldSelector: inputs.parameters.tag.value=suspend2-tag1"
cli_test.go:539:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:539
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:450
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/when.go:459
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:538
Error: Received unexpected error:
exit status 1
Test: TestCLISuite/TestNodeSuspendResume
Waiting 1m0s for workflow metadata.name=node-suspend-l88wm
● node-suspend-l88wm Workflow 0s
└ ● node-suspend-l88wm Steps 0s
└ ✔ step1 Pod 6s
└ ✔ [0] StepGroup 13s
└ ● suspend1 Suspend 0s
└ ● [1] StepGroup 0s
cli_test.go:543: timeout after 1m0s waiting for condition
Checking expectation node-suspend-l88wm
node-suspend-l88wm : Running
cli_test.go:546:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:546
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:545
Error: Expect "" to match "child 'node-suspend-.*' failed"
Test: TestCLISuite/TestNodeSuspendResume
- [x] Fixed in https://github.com/argoproj/argo-workflows/pull/11056
--- FAIL: Test_createSecretVolumesFromArtifactLocations_SSECUsed (0.01s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x118 pc=0x1c983f3]
goroutine 22671 [running]:
testing.tRunner.func1.2({0x1ee10a0, 0x356a460})
/opt/hostedtoolcache/go/1.20.3/x64/src/testing/testing.go:1526 +0x24e
testing.tRunner.func1()
/opt/hostedtoolcache/go/1.20.3/x64/src/testing/testing.go:1529 +0x39f
panic({0x1ee10a0, 0x356a460})
/opt/hostedtoolcache/go/1.20.3/x64/src/runtime/panic.go:884 +0x213
github.com/argoproj/argo-workflows/v3/workflow/controller.Test_createSecretVolumesFromArtifactLocations_SSECUsed(0xc002a8c640?)
/home/runner/work/argo-workflows/argo-workflows/workflow/controller/workflowpod_test.go:1250 +0x773
testing.tRunner(0xc001d3d860, 0x231d1f0)
/opt/hostedtoolcache/go/1.20.3/x64/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
/opt/hostedtoolcache/go/1.20.3/x64/src/testing/testing.go:1629 +0x3ea
FAIL github.com/argoproj/argo-workflows/v3/workflow/controller 30.340s
- [x] https://github.com/argoproj/argo-workflows/pull/11060
functional_test.go:736:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/functional_test.go:736
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
/home/runner/work/argo-workflows/argo-workflows/test/e2e/functional_test.go:735
Error: Not equal:
expected: "Failed"
actual : "Running"
Diff:
--- Expected
+++ Actual
@@ -1,2 +1,2 @@
-(v1alpha1.WorkflowPhase) (len=6) "Failed"
+(v1alpha1.WorkflowPhase) (len=7) "Running"
Test: TestFunctionalSuite/TestParametrizableAds
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Keep open
- [x] https://github.com/argoproj/argo-workflows/pull/11346
hooks_test.go:419:
[1000](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1001)
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/hooks_test.go:419
[1001](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1002)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
[1002](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1003)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
[1003](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1004)
/home/runner/work/argo-workflows/argo-workflows/test/e2e/hooks_test.go:417
[1004](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1005)
Error: Not equal:
[1005](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1006)
expected: "1/1"
[1006](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1007)
actual : "2/2"
[1007](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1008)
[1008](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1009)
Diff:
[1009](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1010)
--- Expected
[1010](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1011)
+++ Actual
[1011](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1012)
@@ -1,2 +1,2 @@
[1012](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1013)
-(v1alpha1.Progress) (len=3) "1/1"
[1013](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1014)
+(v1alpha1.Progress) (len=3) "2/2"
[1014](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1015)
[1015](https://github.com/argoproj/argo-workflows/actions/runs/5524469870/jobs/10077534051?pr=11338#step:16:1016)
Test: TestHooksSuite/TestTemplateLevelHooksWaitForTriggeredHook
- [x] https://github.com/argoproj/argo-workflows/pull/11378
cli_test.go:360:
Error Trace: /home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:360
/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:265
/home/runner/work/argo-workflows/argo-workflows/test/e2e/cli_test.go:357
Error: "[log-problems-n4r8n-report-4285119875: one log-problems-n4r8n-report-3816278739: three log-problems-n4r8n-report-4205137589: four log-problems-n4r8n-report-803032099: five]" should have 5 item(s), but has 4
Test: TestCLISuite/TestLogProblems
Another one: https://github.com/argoproj/argo-workflows/pull/11384
Hooks tests are very flaky. Disabled them for now. Need to investigate potential bugs:
- [ ] https://github.com/argoproj/argo-workflows/pull/11406
- [ ] https://github.com/argoproj/argo-workflows/pull/11384
cc @toyamagu-2021 Would you like to help us debug these since you added these tests? (after you wrap up with the UI issues)
Yes, of course. At first glance, it might be because the workflow is marked as completed before running-hook triggered. (Truly edge case?) I will try to address that later.
Adding TestTemplateLevelHooksDagSuccessVersion from https://github.com/argoproj/argo-workflows/pull/10307#issuecomment-1720028813 to the list. It was mentioned above too, but only 1/2 failing tests in that comment were fixed. That one may actually be a bug (as it's getting a nil pointer, not just a failed test), not just a flake, not sure.
Also this is technically a duplicate issue of #9027. They've got different flakes listed in each, but could consolidate into one issue
I think we should really prevent these flaky tests being merged in the first place. @terrytangyuan and @agilgur5 what are your opinions on running the test suite in parallel 10 (or so) times and only allowing merging when for all runs the tests passed? If we can launch the jobs in parallel, we shouldn't suffer any wait time increases.
We probably would need to pay for the extra compute, but I suspect it'd be cheaper than the person hours that go into dealing with flakey tests.
If we can launch the jobs in parallel, we shouldn't suffer any wait time increases.
Ostensibly yes, but the average wait time would increase since some jobs queue longer than others and some wait on network longer etc. This would probably put us over the limit of parallel jobs more frequently, causing more queueing as well
@terrytangyuan and @agilgur5 what are your opinions on running the test suite in parallel 10 (or so) times and only allowing merging when for all runs the tests passed?
I don't think this would actually help solve the problem. We're taking somewhat inaccurate flakey tests usually caused by race conditions and taking an even more inaccurate approach to it of "run all tests more times".
Most PRs don't even change the tests much, if at all, but they will fail more often with a change like this.
but I suspect it'd be cheaper than the person hours that go into dealing with flakey tests.
which would cause a lot of hours of investigation or confusion due to the existing flakes that were not caused by new code. that's the current biggest issue, and this would increase that.
I think we should be more precise in our approach.
So if we wanted to take an approach like this I would recommend one of:
- only running the new tests from a PR several times -- EDIT: this is not quite correct, see below
- run flake detection on a schedule, e.g. nightly or weekly
- we'd also want to run with
go test -race
- we'd also want to run with
Ostensibly yes, but the average wait time would increase since some jobs queue longer than others and some wait on network longer etc. This would probably put us over the limit of parallel jobs more frequently, causing more queueing as well
I presume we are paying more capacity here, but I can't see the time increasing by that much, sure some pipelines will take a bit longer but thats fine as long as the wait times generally are similar to what we have now.
I don't think this would actually help solve the problem. We're taking somewhat inaccurate flakey tests usually caused by race conditions and taking an even more inaccurate approach to it of "run all tests more times".
I see where you are coming from, I kind of elided the fact that when we implement this, we should have no more flakey tests.
Most PRs don't even change the tests much, if at all, but they will fail more often with a change like this.
only running the new tests from a PR several times
This effectively what I am saying I suppose, but to be more precise my suggestion is a) fix all current flakey tests and do not accept anymore PRs that introduce tests (delay feature PRs as well) b) then run tests from each new PRs several times, perhaps it is enough to only do this when new tests are introduced?
Some kind of flake test detection would be nice to have as well.
I presume we are paying more capacity here
as long as the wait times generally are similar to what we have now.
As far as I know, we're not currently paying anything and are on the free plan. There are concurrency limits that apply per plan (and I believe they apply to the entire GH org, not per repo). If we run 140+ (10 * (13 E2Es + 1 unit tests)) more jobs per run, we will almost certainly hit that limit, which will cause queueing, i.e. some parallel jobs will end up running sequentially, which will definitely increase wait times. It also may increase wait times across the argoproj GH org.
I kind of elided the fact that when we implement this, we should have no more flakey tests.
Tall ask -- will this ever be true? 😅
perhaps it is enough to only do this when new tests are introduced?
I wrote to only run the new tests themselves multiple times. But actually, rethinking this, neither of these would be correct; a source code change can cause a test flake in an existing test. E.g. a new unhandled race was introduced. That exact scenario has happened multiple times already
I'm still thinking a nightly or weekly job would make more sense than on each PR.