SCS.jl icon indicating copy to clipboard operation
SCS.jl copied to clipboard

GPU tests failing

Open odow opened this issue 1 year ago • 6 comments

cc @kalmarek

Some examples have nan:

image

odow avatar Apr 18 '24 08:04 odow

this is the last one that doesn't nan: https://buildkite.com/julialang/scs-dot-jl/builds/281#018ccf8a-baeb-409b-9fe5-fbde2f42e4bc

and the first one which nans https://buildkite.com/julialang/scs-dot-jl/builds/283#018ea03d-8b93-40f3-8e1b-6e6a37dec3c8

but this ci run was just after change to README (and before enabling openmp). Smells like something in the CUDA toolchain?!

kalmarek avatar Apr 18 '24 09:04 kalmarek

There are quite a few versions changes so not sure what the culprit is.

odow avatar Apr 18 '24 09:04 odow

The successful one uses

   Installed CUDA_Driver_jll ── v0.7.0+1
   Installed CUDA_Runtime_jll ─ v0.11.1+0
   Installed SCS_GPU_jll ────── v3.2.4+0

The failing one does

   Installed SCS_GPU_jll ────── v3.2.4+0
   Installed CUDA_Driver_jll ── v0.8.0+0
   Installed CUDA_Runtime_jll ─ v0.12.0+1

So this seems to be a problem with cuda-12? @maleadt (sorry if you get too many pings)

kalmarek avatar Jul 06 '24 09:07 kalmarek

Upgrading CUDA_Runtime_jll only updates the underlying CUDA toolkit. Maybe your package is incompatible with the CUDA toolkit v12.4 as introduced by Runtime_jll 0.12, or needs a rebuild.

maleadt avatar Jul 08 '24 08:07 maleadt

@maleadt It seems that the newest scs was already built against CUDA toolkit 12.4/5: https://buildkite.com/julialang/yggdrasil/builds/11739#01908495-78c0-45ae-8bf6-28205badd6b6

@bodono did you test scs with CUDA-12? some examples here run just fine (so I think we're interacting with the library correctly), but some end with bunch of nans.

kalmarek avatar Jul 14 '24 16:07 kalmarek

Unfortunately if CUDA 12 is newish then it's likely that I have never tested with it, since I no longer have access to a GPU machine. The github action I have for gpus only compiles it.

bodono avatar Jul 15 '24 16:07 bodono