dask gpuCI broken

trafficstars

This has been raised on https://github.com/dask/dask/pull/11242 already but I always have difficulties finding that draft PR and the failures are not related to a version update from what I can tell.

gpuCI has been pretty consistently failing for a while now.

Logs show something like (from https://github.com/dask/dask/pull/11310 // https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/console)

15:05:47 GitHub pull request #11310 of commit 355b76fc0632708894cfc1c17ce55b80cef8bbbb, no merge conflicts.
15:05:47 Running as SYSTEM
15:05:47 Setting status of 355b76fc0632708894cfc1c17ce55b80cef8bbbb to PENDING with url https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/ and message: 'Running'
15:05:47 Using context: gpuCI/dask/pr-builder
15:10:13 FATAL: java.io.IOException: Unexpected EOF
15:10:13 java.io.IOException: Unexpected EOF
15:10:13 	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:101)
15:10:13 	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
15:10:13 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
15:10:13 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
15:10:13 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to EC2 (aws-b) - runner-m5d2xl (i-00c57ce783f1c62db)
15:10:13 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1787)
15:10:13 		at hudson.remoting.Request.call(Request.java:199)
15:10:13 		at hudson.remoting.Channel.call(Channel.java:1002)
15:10:13 		at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1121)
15:10:13 		at hudson.Launcher$ProcStarter.start(Launcher.java:506)
15:10:13 		at hudson.Launcher$ProcStarter.join(Launcher.java:517)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.AbstractDockerLauncher.parseVersion(AbstractDockerLauncher.java:193)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.AbstractDockerLauncher.<init>(AbstractDockerLauncher.java:54)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.DockerLauncher.<init>(DockerLauncher.java:54)
15:10:13 		at com.gpuopenanalytics.jenkins.remotedocker.RemoteDockerBuildWrapper.decorateLauncher(RemoteDockerBuildWrapper.java:164)
15:10:13 		at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:613)
15:10:13 		at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
15:10:13 		at hudson.model.Run.execute(Run.java:1894)
15:10:13 		at hudson.matrix.MatrixBuild.run(MatrixBuild.java:323)
15:10:13 		at hudson.model.ResourceController.execute(ResourceController.java:101)
15:10:13 		at hudson.model.Executor.run(Executor.java:442)
15:10:13 Caused: hudson.remoting.RequestAbortedException
15:10:13 	at hudson.remoting.Request.abort(Request.java:346)
15:10:13 	at hudson.remoting.Channel.terminate(Channel.java:1083)
15:10:13 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:90)
15:10:13 Setting status of 355b76fc0632708894cfc1c17ce55b80cef8bbbb to FAILURE with url https://gpuci.gpuopenanalytics.com/job/dask/job/dask/job/prb/job/dask-prb/6185/ and message: 'Build failure
15:10:13  '
15:10:13 Using context: gpuCI/dask/pr-builder
15:10:13 Finished: FAILURE

cc @dask/gpu

Aug 14 '24 13:08 fjetter

Thanks for raising an issue @fjetter - I'll definitely work on getting gpuCI back in a passing state today.

I'm not sure what is causing the build failure, but I do know some recent dask/array work has definitely broken cupy support.

Aug 14 '24 13:08 rjzamora

I'm really sorry there has been so much unwanted gpuCI noise lately. It looks like gpuCI is now "fixed" in the sense that the pytests should all pass. However, the java.io.IOException described at the top of this issue does still happen intermittently for some reason.

We have not figured out how to fix this intermittent failure yet. However, it you do happen to see this failure in the wild, members of the dask org can re-run the gpuCI check (and only that check) by commenting: Rerun tests (e.g. https://github.com/dask/dask/pull/11294#issuecomment-2307391540)

cc @fjetter @phofl @jrbourbeau @hendrikmakait (just to make sure you know about Rerun tests)

Aug 23 '24 16:08 rjzamora

@dask/gpu gpuCI appears to be broken again. One example https://github.com/dask/dask/pull/11354 but there are other failures and it looks quite intermittent. Looking at Jenkins this almost feels like a gpuCI internal problem.

Aug 30 '24 13:08 fjetter

Looking at Jenkins this almost feels like a gpuCI internal problem.

Right, the failures are intermittent, and can always be re-run with a Rerun tests comment (typically takes a few minutes for gpuCI to turn green after you make the comment).

Our ops team is working on a replacement to our current Jenkins infrastructure at the moment. I'm sorry again for the noise.

Aug 30 '24 14:08 rjzamora

How long will it take to replace the Jenkins infra? I currently feel gpuCI is not delivering a lot of value and is just noise. Would you mind if we disabled this until it is reliable again?

Aug 30 '24 15:08 fjetter

How long will it take to replace the Jenkins infra? I currently feel gpuCI is not delivering a lot of value and is just noise. Would you mind if we disabled this until it is reliable again?

We are discussing this internally to figure out the best way to proceed, but I do have a strong preference to keep gpuCI turned on for now if you/others are willing.

Our team obviously finds gpuCI valuable, but I do understand why you would see things a different way. When gpuCI was actually broken a few weeks ago (not just flaky the way it is now), changes were merged into main that broke cupy support. In theory, gpuCI is a convenient way for contributors/maintainers to know right away if a new change is likely to break GPU compatibility.

The alternative is of course that we (RAPIDS) run our own nightly tests against main, and raise an issue when something breaks. In some cases, the fix will be simple. In others, the change could be a nightmare to roll back or fix. What would be an ideal developer experience on your end? I'm hoping we can work toward something that makes everyone "happy enough".

Aug 30 '24 23:08 rjzamora

Roughly a year ago we had proposed moving Dask to a GitHub Actions based system for GPU CI in issue: https://github.com/dask/community/issues/348

We didn't hear much from other maintainers there (admittedly there could have been offline discussion I'm unaware of)

Perhaps it is worth reading that issue and sharing your thoughts on that approach? 🙂

Aug 30 '24 23:08 jakirkham

dask dask copied to clipboard

gpuCI broken

dask
dask copied to clipboard