argo-workflows
argo-workflows copied to clipboard
fix: test if e2e tests are failing due to flakiness
Fixes #TODO
Motivation
Modifications
Verification
Rebased the branch to an earlier version of main
before #12736 was merged to see if that's the cause.
See also Slack thread
CI is still failing with the same errors, so my next guess would be #12741 , although that one's pretty small & targeted
And it's still failing with the same error on a version of main
prior to #12741 being merged. I'm even more confused now....
Hmm I thought it was https://github.com/argoproj/argo-workflows/pull/12736 as well due to a git blame I came upon.
I suppose the quickest way to find out what the first offending commit is to git bisect.
That uh assumes knowing a "good" commit, and I'm not sure which one is good 😬
Alan and I were thinking on Slack that it could potentially be an unpinned dep somewhere, though I couldn't pinpoint one that added up in the manifests. I was looking at the debug logs and wondering if the tests might be pulling argocli:latest
from the web instead of the locally built one
The last rebase I tried was on top of / after #12972, and it's still failing with the same errors
Ok rebased once more to after #12917 annnd still same error. I chose that one as it was what #12926 was based on, which just merged (i.e. passed E2E on its own branch).
So pretty sure it's not due to a commit at this point
Noting here that the failing API test does result in a nil pointer dereference error in the Server logs:
server: time="2024-05-05T02:31:39.703Z" level=info msg="Get artifact file" artifactName=local-artifact namespace=argo nodeId=wf-stopped-8j2gw-1075147760 uid=3b15ead1-612c-4820-8d62-e72018069605
server: 2024/05/05 02:31:39 http: panic serving [::1]:54866: runtime error: invalid memory address or nil pointer dereference
server: goroutine 295 [running]:
server: net/http.(*conn).serve.func1()
server: /opt/hostedtoolcache/go/1.21.9/x64/src/net/http/server.go:1868 +0xb9
server: panic({0x2994660?, 0x4b2cd40?})
server: /opt/hostedtoolcache/go/1.21.9/x64/src/runtime/panic.go:920 +0x270
server: github.com/argoproj/argo-workflows/v3/server/artifacts.(*ArtifactServer).GetArtifactFile(0xc000716960, {0x3303b00, 0xc0002d80e0}, 0xc000c3c000)
server: /home/runner/work/argo-workflows/argo-workflows/server/artifacts/artifact_server.go:141 +0x8b8
I've managed to produce ErrImageNeverPull: Container image "argoproj/argocli:latest" is not present with pull policy of Never
locally by adding a Never image pull policy for artgc-dag-workflow-stopper
which is used in artgc-dag-wf-self-delete
, one of the failing tests.
I can't see where the argocli image gets built and pushed during e2e, so I'll see if I can find a satisfactory fix for this.
This is fixed in #13018
Alan and I were thinking on Slack that it could potentially be an unpinned dep somewhere, though I couldn't pinpoint one that added up in the manifests. I was looking at the debug logs and wondering if the tests might be pulling
argocli:latest
from the web instead of the locally built one
I've managed to produce
ErrImageNeverPull: Container image "argoproj/argocli:latest" is not present with pull policy of Never
locally by adding a Never image pull policy forartgc-dag-workflow-stopper
which is used inartgc-dag-wf-self-delete
, one of the failing tests.
This is fixed in #13018
Thanks Alan for confirming my hypothesis and fixing the imagePullPolicy
so that we don't run into that issue again!
We should still pin all the other images in the test manifests too, to prevent a similar type of bug (or supply chain attack via tests)