raft icon indicating copy to clipboard operation
raft copied to clipboard

[BUG] Nightly CI issue: CUDA 11.4 jobs were running with CUDA 11.8 when nccl wasn't available

Open dantegd opened this issue 1 year ago • 0 comments

NCCL 2.22.3.1 in conda-forge was not available for CUDA < 11.8 until yesterday, which was reflected in cuML's CI by failing all CUDA 11.4 jobs until today. But RAFT's CUDA 11.4 CI was passing regardless (which confused me for a while).

Checking the jobs, they were installing cuda-version 11.8 and corresponding packages, from this CUDA 11.4 log for example, the following snippets show the issue when installing the downloaded artifacts

  Upgrade:
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  - cuda-version                              11.4  hfb901f2_3                       conda-forge             Cached
  + cuda-version                              11.8  h70ddcb2_3                       conda-forge               21kB
  - cudatoolkit                             11.4.3  h39f8164_13                      conda-forge             Cached
  + cudatoolkit                             11.8.0  h4ba93d1_13                      conda-forge              716MB

which should not be happening on CUDA 11.4 jobs of course. I think this shouldn't be an issue now with nccl, but any other package could cause a situation like this, This could make things fail silently in the future and catch us by surprise, eliminating the point of having 11.4 jobs in nightly CI.

dantegd avatar Jul 30 '24 19:07 dantegd