software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

Rebuild GPU software for all supported combinations of CPU and CUDA compute capabilities

Open ocaisa opened this issue 8 months ago • 38 comments

We're not going to able to test all possible CPU/GPU combinations, but we need a general approach to allow us to move forward with only testing a subset while providing more possibilities. This PR is intended to put this workflow in place and begin rebuilding all GPU packages to reflect the changes.

  • [ ] Target a specific set of CPU/compute-capability combinations
    • [ ] 7.0, 8.0, 9.0 for all CPU architectures
    • [ ] Major version CC device code will run on all minor versions (e.g., 8.0 device code on 8.6-capable GPU) so allow for this fallback
    • [ ] Allow for a specific set of additional CPU/cc combinations for known hardware
    • [ ] Document these combinations in the EasyBuild hook and complain if they are not being respected
  • [ ] Modify EasyBuild hooks:
    • [ ] to reflect whether module has been tested or not: check for an available device matching the target cc, if none exist add an Lmod footer or module description explaining this
    • [ ] to fail when attempting to install a package that is not CUDA or has no CUDA dependency into an accel subdir (with advice about what to do)
    • [ ] to enforce compilation for device code (with ptx) by setting NVCC_PREPEND_FLAGS='-arch=sm_XX' for the build (this probably has falllout so should be considered as a nice-to-have)

ocaisa avatar Mar 17 '25 16:03 ocaisa

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphirerapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot[bot] avatar Mar 17 '25 16:03 eessi-bot[bot]

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

eessi-bot[bot] avatar Mar 17 '25 16:03 eessi-bot[bot]

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

eessi-bot-trz42[bot] avatar Mar 17 '25 16:03 eessi-bot-trz42[bot]

Instance rt-Grace-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75

ocaisa avatar Mar 17 '25 16:03 ocaisa

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • submitted job 50733, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730174706
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • submitted job 50734, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730175003

eessi-bot[bot] avatar Mar 17 '25 16:03 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar Mar 17 '25 16:03 eessi-bot[bot]

Updates by the bot instance rt-Grace-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

Updates by the bot instance trz42-GH200-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

eessi-bot-trz42[bot] avatar Mar 17 '25 16:03 eessi-bot-trz42[bot]

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50733

date job status comment
Mar 17 16:37:55 UTC 2025 submitted job id 50733 awaits release by job manager
Mar 17 16:38:56 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 16:45:02 UTC 2025 running job 50733 is running
Mar 17 17:27:54 UTC 2025 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-50733.out
:white_check_mark: no message matching FATAL:
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742231250.tar.gzsize: 0 MiB (3706 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 17:27:54 UTC 2025 test result
:cry: FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
:white_check_mark: job output file slurm-50733.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Mar 17 '25 16:03 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50734

date job status comment
Mar 17 16:38:00 UTC 2025 submitted job id 50734 awaits release by job manager
Mar 17 16:38:54 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 16:43:59 UTC 2025 running job 50734 is running
Mar 17 16:48:08 UTC 2025 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-50734.out
:white_check_mark: no message matching FATAL:
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742229971.tar.gzsize: 0 MiB (3705 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 16:48:08 UTC 2025 test result
:cry: FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
:white_check_mark: job output file slurm-50734.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Mar 17 '25 16:03 eessi-bot[bot]

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75

ocaisa avatar Mar 17 '25 18:03 ocaisa

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • submitted job 50735, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730491195
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • submitted job 50736, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730491339

eessi-bot[bot] avatar Mar 17 '25 18:03 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar Mar 17 '25 18:03 eessi-bot[bot]

Updates by the bot instance rt-Grace-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

Updates by the bot instance trz42-GH200-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

eessi-bot-trz42[bot] avatar Mar 17 '25 18:03 eessi-bot-trz42[bot]

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50735

date job status comment
Mar 17 18:34:05 UTC 2025 submitted job id 50735 awaits release by job manager
Mar 17 18:35:04 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 18:41:09 UTC 2025 running job 50735 is running
Mar 17 18:44:16 UTC 2025 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-50735.out
:white_check_mark: no message matching FATAL:
:x: found message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742236924.tar.gzsize: 0 MiB (3704 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 18:44:16 UTC 2025 test result
:cry: FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
:white_check_mark: job output file slurm-50735.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Mar 17 '25 18:03 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50736

date job status comment
Mar 17 18:34:10 UTC 2025 submitted job id 50736 awaits release by job manager
Mar 17 18:35:02 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 18:41:07 UTC 2025 running job 50736 is running
Mar 17 18:44:14 UTC 2025 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-50736.out
:white_check_mark: no message matching FATAL:
:x: found message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742236952.tar.gzsize: 0 MiB (3704 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 18:44:14 UTC 2025 test result
:cry: FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
:white_check_mark: job output file slurm-50736.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Mar 17 '25 18:03 eessi-bot[bot]

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75

ocaisa avatar Mar 17 '25 19:03 ocaisa

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • submitted job 50737, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730557907
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • submitted job 50738, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730558084

eessi-bot[bot] avatar Mar 17 '25 19:03 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar Mar 17 '25 19:03 eessi-bot[bot]

Updates by the bot instance trz42-GH200-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

eessi-bot-trz42[bot] avatar Mar 17 '25 19:03 eessi-bot-trz42[bot]

Updates by the bot instance rt-Grace-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50737

date job status comment
Mar 17 19:03:59 UTC 2025 submitted job id 50737 awaits release by job manager
Mar 17 19:04:22 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 19:05:26 UTC 2025 running job 50737 is running
Mar 17 21:48:09 UTC 2025 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-50737.out
:white_check_mark: no message matching FATAL:
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742245284.tar.gzsize: 5342 MiB (5602332191 bytes)
entries: 16543
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1.lua
LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.lua
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0.lua
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
CUDA/12.4.0
cuDNN/8.9.2.26-CUDA-12.1.1
ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1
LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 21:48:09 UTC 2025 test result
:cry: FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
:white_check_mark: job output file slurm-50737.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Mar 17 '25 19:03 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50738

date job status comment
Mar 17 19:04:04 UTC 2025 submitted job id 50738 awaits release by job manager
Mar 17 19:04:20 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 19:05:24 UTC 2025 running job 50738 is running
Mar 17 21:34:54 UTC 2025 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-50738.out
:white_check_mark: no message matching FATAL:
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742244793.tar.gzsize: 5453 MiB (5718189739 bytes)
entries: 16639
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1.lua
LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.lua
NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1.lua
NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0.lua
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/software
CUDA/12.1.1
CUDA/12.4.0
cuDNN/8.9.2.26-CUDA-12.1.1
ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1
LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1
NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1
NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 21:34:54 UTC 2025 test result
:cry: FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
:white_check_mark: job output file slurm-50738.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Mar 17 '25 19:03 eessi-bot[bot]

Not sure how this is possible, but EasyBuild is failing to apply the patch to the GROMACS sources:

== FAILED: Installation ended unsuccessfully (build directory: /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0): build failed (first 300 chars): Can't determine patch level for patch /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/EasyBuild/4.9.4/easybuild/easyconfigs/g/GROMACS/GROMACS-2023.1_set_omp_num_threads_env_for_ntomp_tests.patch from directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA- (took 12 mins 15 secs)

ocaisa avatar Mar 18 '25 13:03 ocaisa

I don't get it, if I do it manually it works just fine

ocaisa avatar Mar 18 '25 13:03 ocaisa

@ocaisa Can you clarify in the PR description why these rebuilds are necessary? What has changed to require us rebuilding all of this?

boegel avatar Mar 20 '25 09:03 boegel

Not sure how this is possible, but EasyBuild is failing to apply the patch to the GROMACS sources:

== FAILED: Installation ended unsuccessfully (build directory: /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0): build failed (first 300 chars): Can't determine patch level for patch /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/EasyBuild/4.9.4/easybuild/easyconfigs/g/GROMACS/GROMACS-2023.1_set_omp_num_threads_env_for_ntomp_tests.patch from directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA- (took 12 mins 15 secs)

I think it's because the "start" dir is wrong somehow, it should be /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0/gromacs-2024.4 rather than /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0?

boegel avatar Mar 20 '25 09:03 boegel

== 2025-03-17 20:37:47,651 filetools.py:461 DEBUG Unpacking /project/def-users/bot/shared/easybuild/sources/g/GROMACS/gromacs-2024.4.tar.gz in directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0
...
== 2025-03-17 20:37:47,651 run.py:222 DEBUG run_cmd: running cmd tar xzf /project/def-users/bot/shared/easybuild/sources/g/GROMACS/gromacs-2024.4.tar.gz (in /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0)
...
== 2025-03-17 20:37:48,806 filetools.py:1373 DEBUG Last dir list ['gromacs-2024.4', 'easybuild_obj']
== 2025-03-17 20:37:48,806 filetools.py:1374 DEBUG Possible new dir /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0 found

The easybuild_obj is causing trouble, it's breaking the detection of which directory got unpacked from the source tarball, so gromacs-2024.4 is not marked as start dir as it should be.

boegel avatar Mar 20 '25 09:03 boegel