software-layer
software-layer copied to clipboard
Rebuild GPU software for all supported combinations of CPU and CUDA compute capabilities
We're not going to able to test all possible CPU/GPU combinations, but we need a general approach to allow us to move forward with only testing a subset while providing more possibilities. This PR is intended to put this workflow in place and begin rebuilding all GPU packages to reflect the changes.
- [ ] Target a specific set of CPU/compute-capability combinations
- [ ] 7.0, 8.0, 9.0 for all CPU architectures
- [ ] Major version CC device code will run on all minor versions (e.g., 8.0 device code on 8.6-capable GPU) so allow for this fallback
- [ ] Allow for a specific set of additional CPU/cc combinations for known hardware
- [ ] Document these combinations in the EasyBuild hook and complain if they are not being respected
- [ ] Modify EasyBuild hooks:
- [ ] to reflect whether module has been tested or not: check for an available device matching the target cc, if none exist add an Lmod footer or module description explaining this
- [ ] to fail when attempting to install a package that is not CUDA or has no CUDA dependency into an
accelsubdir (with advice about what to do) - [ ] to enforce compilation for device code (with ptx) by setting
NVCC_PREPEND_FLAGS='-arch=sm_XX'for the build (this probably has falllout so should be considered as a nice-to-have)
Instance eessi-bot-mc-aws is configured to build for:
- architectures:
x86_64/generic,x86_64/intel/haswell,x86_64/intel/sapphirerapids,x86_64/intel/skylake_avx512,x86_64/amd/zen2,x86_64/amd/zen3,aarch64/generic,aarch64/neoverse_n1,aarch64/neoverse_v1 - repositories:
eessi.io-2023.06-software,eessi.io-2023.06-compat
Instance eessi-bot-mc-azure is configured to build for:
- architectures:
x86_64/amd/zen4 - repositories:
eessi.io-2023.06-compat,eessi.io-2023.06-software
Instance trz42-GH200-jr is configured to build for:
- architectures:
aarch64/nvidia/grace - repositories:
eessi.io-2023.06-software
Instance rt-Grace-jr is configured to build for:
- architectures:
aarch64/nvidia/grace - repositories:
eessi.io-2023.06-software
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- submitted job
50733, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730174706
- submitted job
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75resulted in:- submitted job
50734, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730175003
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- no jobs were submitted
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75resulted in:- no jobs were submitted
Updates by the bot instance rt-Grace-jr
(click for details)
- account
ocaisahas NO permission to send commands to the bot
Updates by the bot instance trz42-GH200-jr
(click for details)
- account
ocaisahas NO permission to send commands to the bot
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50733
| date | job status | comment |
|---|---|---|
| Mar 17 16:37:55 UTC 2025 | submitted | job id 50733 awaits release by job manager |
| Mar 17 16:38:56 UTC 2025 | released | job awaits launch by Slurm scheduler |
| Mar 17 16:45:02 UTC 2025 | running | job 50733 is running |
| Mar 17 17:27:54 UTC 2025 | finished | :cry: FAILURE (click triangle for details)
|
| Mar 17 17:27:54 UTC 2025 | test result | :cry: FAILURE (click triangle for details)
|
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50734
| date | job status | comment |
|---|---|---|
| Mar 17 16:38:00 UTC 2025 | submitted | job id 50734 awaits release by job manager |
| Mar 17 16:38:54 UTC 2025 | released | job awaits launch by Slurm scheduler |
| Mar 17 16:43:59 UTC 2025 | running | job 50734 is running |
| Mar 17 16:48:08 UTC 2025 | finished | :cry: FAILURE (click triangle for details)
|
| Mar 17 16:48:08 UTC 2025 | test result | :cry: FAILURE (click triangle for details)
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- submitted job
50735, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730491195
- submitted job
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75resulted in:- submitted job
50736, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730491339
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- no jobs were submitted
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75resulted in:- no jobs were submitted
Updates by the bot instance rt-Grace-jr
(click for details)
- account
ocaisahas NO permission to send commands to the bot
Updates by the bot instance trz42-GH200-jr
(click for details)
- account
ocaisahas NO permission to send commands to the bot
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50735
| date | job status | comment |
|---|---|---|
| Mar 17 18:34:05 UTC 2025 | submitted | job id 50735 awaits release by job manager |
| Mar 17 18:35:04 UTC 2025 | released | job awaits launch by Slurm scheduler |
| Mar 17 18:41:09 UTC 2025 | running | job 50735 is running |
| Mar 17 18:44:16 UTC 2025 | finished | :cry: FAILURE (click triangle for details)
|
| Mar 17 18:44:16 UTC 2025 | test result | :cry: FAILURE (click triangle for details)
|
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50736
| date | job status | comment |
|---|---|---|
| Mar 17 18:34:10 UTC 2025 | submitted | job id 50736 awaits release by job manager |
| Mar 17 18:35:02 UTC 2025 | released | job awaits launch by Slurm scheduler |
| Mar 17 18:41:07 UTC 2025 | running | job 50736 is running |
| Mar 17 18:44:14 UTC 2025 | finished | :cry: FAILURE (click triangle for details)
|
| Mar 17 18:44:14 UTC 2025 | test result | :cry: FAILURE (click triangle for details)
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- submitted job
50737, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730557907
- submitted job
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75resulted in:- submitted job
50738, for details & status see https://github.com/EESSI/software-layer/pull/969#issuecomment-2730558084
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
received bot command
build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75fromocaisa- expanded format:
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- no jobs were submitted
-
handling command
build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75resulted in:- no jobs were submitted
Updates by the bot instance trz42-GH200-jr
(click for details)
- account
ocaisahas NO permission to send commands to the bot
Updates by the bot instance rt-Grace-jr
(click for details)
- account
ocaisahas NO permission to send commands to the bot
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50737
| date | job status | comment |
|---|---|---|
| Mar 17 19:03:59 UTC 2025 | submitted | job id 50737 awaits release by job manager |
| Mar 17 19:04:22 UTC 2025 | released | job awaits launch by Slurm scheduler |
| Mar 17 19:05:26 UTC 2025 | running | job 50737 is running |
| Mar 17 21:48:09 UTC 2025 | finished | :cry: FAILURE (click triangle for details)
|
| Mar 17 21:48:09 UTC 2025 | test result | :cry: FAILURE (click triangle for details)
|
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50738
| date | job status | comment |
|---|---|---|
| Mar 17 19:04:04 UTC 2025 | submitted | job id 50738 awaits release by job manager |
| Mar 17 19:04:20 UTC 2025 | released | job awaits launch by Slurm scheduler |
| Mar 17 19:05:24 UTC 2025 | running | job 50738 is running |
| Mar 17 21:34:54 UTC 2025 | finished | :cry: FAILURE (click triangle for details)
|
| Mar 17 21:34:54 UTC 2025 | test result | :cry: FAILURE (click triangle for details)
|
Not sure how this is possible, but EasyBuild is failing to apply the patch to the GROMACS sources:
== FAILED: Installation ended unsuccessfully (build directory: /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0): build failed (first 300 chars): Can't determine patch level for patch /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/EasyBuild/4.9.4/easybuild/easyconfigs/g/GROMACS/GROMACS-2023.1_set_omp_num_threads_env_for_ntomp_tests.patch from directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA- (took 12 mins 15 secs)
I don't get it, if I do it manually it works just fine
@ocaisa Can you clarify in the PR description why these rebuilds are necessary? What has changed to require us rebuilding all of this?
Not sure how this is possible, but EasyBuild is failing to apply the patch to the GROMACS sources:
== FAILED: Installation ended unsuccessfully (build directory: /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0): build failed (first 300 chars): Can't determine patch level for patch /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/EasyBuild/4.9.4/easybuild/easyconfigs/g/GROMACS/GROMACS-2023.1_set_omp_num_threads_env_for_ntomp_tests.patch from directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA- (took 12 mins 15 secs)
I think it's because the "start" dir is wrong somehow, it should be /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0/gromacs-2024.4 rather than /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0?
== 2025-03-17 20:37:47,651 filetools.py:461 DEBUG Unpacking /project/def-users/bot/shared/easybuild/sources/g/GROMACS/gromacs-2024.4.tar.gz in directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0
...
== 2025-03-17 20:37:47,651 run.py:222 DEBUG run_cmd: running cmd tar xzf /project/def-users/bot/shared/easybuild/sources/g/GROMACS/gromacs-2024.4.tar.gz (in /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0)
...
== 2025-03-17 20:37:48,806 filetools.py:1373 DEBUG Last dir list ['gromacs-2024.4', 'easybuild_obj']
== 2025-03-17 20:37:48,806 filetools.py:1374 DEBUG Possible new dir /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0 found
The easybuild_obj is causing trouble, it's breaking the detection of which directory got unpacked from the source tarball, so gromacs-2024.4 is not marked as start dir as it should be.