argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

fix: test if e2e tests are failing due to flakiness

Open isubasinghe opened this issue 9 months ago • 8 comments

Fixes #TODO

Motivation

Modifications

Verification

isubasinghe avatar May 02 '24 06:05 isubasinghe

Rebased the branch to an earlier version of main before #12736 was merged to see if that's the cause. See also Slack thread

agilgur5 avatar May 04 '24 16:05 agilgur5

CI is still failing with the same errors, so my next guess would be #12741 , although that one's pretty small & targeted

agilgur5 avatar May 04 '24 16:05 agilgur5

And it's still failing with the same error on a version of main prior to #12741 being merged. I'm even more confused now....

agilgur5 avatar May 04 '24 16:05 agilgur5

Hmm I thought it was https://github.com/argoproj/argo-workflows/pull/12736 as well due to a git blame I came upon.

I suppose the quickest way to find out what the first offending commit is to git bisect.

isubasinghe avatar May 05 '24 00:05 isubasinghe

That uh assumes knowing a "good" commit, and I'm not sure which one is good 😬

Alan and I were thinking on Slack that it could potentially be an unpinned dep somewhere, though I couldn't pinpoint one that added up in the manifests. I was looking at the debug logs and wondering if the tests might be pulling argocli:latest from the web instead of the locally built one

agilgur5 avatar May 05 '24 02:05 agilgur5

The last rebase I tried was on top of / after #12972, and it's still failing with the same errors

agilgur5 avatar May 05 '24 02:05 agilgur5

Ok rebased once more to after #12917 annnd still same error. I chose that one as it was what #12926 was based on, which just merged (i.e. passed E2E on its own branch).

So pretty sure it's not due to a commit at this point

agilgur5 avatar May 05 '24 02:05 agilgur5

Noting here that the failing API test does result in a nil pointer dereference error in the Server logs:

server: time="2024-05-05T02:31:39.703Z" level=info msg="Get artifact file" artifactName=local-artifact namespace=argo nodeId=wf-stopped-8j2gw-1075147760 uid=3b15ead1-612c-4820-8d62-e72018069605
server: 2024/05/05 02:31:39 http: panic serving [::1]:54866: runtime error: invalid memory address or nil pointer dereference
server: goroutine 295 [running]:
server: net/http.(*conn).serve.func1()
server: 	/opt/hostedtoolcache/go/1.21.9/x64/src/net/http/server.go:1868 +0xb9
server: panic({0x2994660?, 0x4b2cd40?})
server: 	/opt/hostedtoolcache/go/1.21.9/x64/src/runtime/panic.go:920 +0x270
server: github.com/argoproj/argo-workflows/v3/server/artifacts.(*ArtifactServer).GetArtifactFile(0xc000716960, {0x3303b00, 0xc0002d80e0}, 0xc000c3c000)
server: 	/home/runner/work/argo-workflows/argo-workflows/server/artifacts/artifact_server.go:141 +0x8b8

agilgur5 avatar May 06 '24 15:05 agilgur5

I've managed to produce ErrImageNeverPull: Container image "argoproj/argocli:latest" is not present with pull policy of Never locally by adding a Never image pull policy for artgc-dag-workflow-stopper which is used in artgc-dag-wf-self-delete, one of the failing tests.

I can't see where the argocli image gets built and pushed during e2e, so I'll see if I can find a satisfactory fix for this.

Joibel avatar May 07 '24 21:05 Joibel

This is fixed in #13018

Joibel avatar May 08 '24 08:05 Joibel

Alan and I were thinking on Slack that it could potentially be an unpinned dep somewhere, though I couldn't pinpoint one that added up in the manifests. I was looking at the debug logs and wondering if the tests might be pulling argocli:latest from the web instead of the locally built one

I've managed to produce ErrImageNeverPull: Container image "argoproj/argocli:latest" is not present with pull policy of Never locally by adding a Never image pull policy for artgc-dag-workflow-stopper which is used in artgc-dag-wf-self-delete, one of the failing tests.

This is fixed in #13018

Thanks Alan for confirming my hypothesis and fixing the imagePullPolicy so that we don't run into that issue again!

We should still pin all the other images in the test manifests too, to prevent a similar type of bug (or supply chain attack via tests)

agilgur5 avatar May 08 '24 13:05 agilgur5