software-layer
software-layer copied to clipboard
{2023.06}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1
Builds
magma/2.7.2-foss-2023a-CUDA-12.1.1
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
Superseedes #718
Instance eessi-bot-mc-aws is configured to build for:
- architectures:
x86_64/generic,x86_64/intel/haswell,x86_64/intel/skylake_avx512,x86_64/amd/zen2,x86_64/amd/zen3,aarch64/generic,aarch64/neoverse_n1,aarch64/neoverse_v1 - repositories:
eessi.io-2023.06-compat,eessi-hpc.org-2023.06-software,eessi-hpc.org-2023.06-compat,eessi.io-2023.06-software
Instance eessi-bot-riscv is configured to build for:
- architectures:
riscv64/generic - repositories:
riscv.eessi.io-20240402
Instance eessi-bot-riscv is configured to build for:
- architectures:
riscv64/generic - repositories:
riscv.eessi.io-20240402
Instance eessi-bot-mc-azure is configured to build for:
- architectures:
x86_64/amd/zen4 - repositories:
eessi.io-2023.06-compat,eessi.io-2023.06-software
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80
Updates by the bot instance eessi-bot-riscv
(click for details)
- account
trz42has NO permission to send commands to the bot
Updates by the bot instance eessi-bot-riscv
(click for details)
- account
trz42has NO permission to send commands to the bot
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- submitted job
30162, for details & status see https://github.com/EESSI/software-layer/pull/825#issuecomment-2492210018
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.11/pr_825/30162
| date | job status | comment |
|---|---|---|
| Nov 21 20:27:08 UTC 2024 | submitted | job id 30162 awaits release by job manager |
| Nov 21 20:27:13 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Nov 21 20:28:17 UTC 2024 | running | job 30162 is running |
| Nov 21 22:32:31 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Nov 21 22:32:31 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
- some
/tmp/eb-m470fsqz/eb-r9zeygi0/tmpb1cr2ofk/rpath_wrappers/gxx_wrapper/g++run failed with/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/bin/ld: warning: libcupti.so.12, needed by lib/libtorch_cpu.so, not found (try using -rpath or -rpath-link) /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/bin/ld: lib/libtorch_cpu.so: undefined reference to `[email protected]' /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/bin/ld: lib/libtorch_cpu.so: undefined reference to `[email protected]' ... /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/bin/ld: lib/libtorch_cpu.so: undefined reference to `[email protected]' collect2: error: ld returned 1 exit status - we should be able to fix this by adding the directory that contains libcupti to
$LIBRARY_PATHin apre_configurehook (see https://github.com/NorESSI/software-layer/pull/369)
Build again after applying fix to find libcupti...
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80
Updates by the bot instance eessi-bot-riscv
(click for details)
- account
trz42has NO permission to send commands to the bot
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- submitted job
30339, for details & status see https://github.com/EESSI/software-layer/pull/825#issuecomment-2493133310
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.11/pr_825/30339
| date | job status | comment |
|---|---|---|
| Nov 22 08:13:18 UTC 2024 | submitted | job id 30339 awaits release by job manager |
| Nov 22 08:13:44 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Nov 22 08:19:50 UTC 2024 | running | job 30339 is running |
| Nov 22 18:13:32 UTC 2024 | finished | :grin: SUCCESS (click triangle for details)
|
| Nov 22 18:13:32 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Also build for zen3…
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80
Updates by the bot instance eessi-bot-riscv
(click for details)
- account
trz42has NO permission to send commands to the bot
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80resulted in:- submitted job
30343, for details & status see https://github.com/EESSI/software-layer/pull/825#issuecomment-2494484088
- submitted job
Updates by the bot instance eessi-bot-riscv
(click for details)
- account
trz42has NO permission to send commands to the bot
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.11/pr_825/30343
| date | job status | comment |
|---|---|---|
| Nov 22 18:21:24 UTC 2024 | submitted | job id 30343 awaits release by job manager |
| Nov 22 18:21:35 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Nov 22 18:27:37 UTC 2024 | running | job 30343 is running |
| Nov 23 02:24:10 UTC 2024 | finished | :grin: SUCCESS (click triangle for details)
|
| Nov 23 02:24:10 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Try a different approach where we rebuild the CUDA module such that it prepends the directory containing the libcupti library to LIBRARY_PATH and then not using the hook used in the previous builds...
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- submitted job
30521, for details & status see https://github.com/EESSI/software-layer/pull/825#issuecomment-2495433367
- submitted job
Updates by the bot instance eessi-bot-riscv
(click for details)
- account
trz42has NO permission to send commands to the bot
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.11/pr_825/30521
| date | job status | comment |
|---|---|---|
| Nov 23 10:37:14 UTC 2024 | submitted | job id 30521 awaits release by job manager |
| Nov 23 10:37:46 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Nov 23 10:42:48 UTC 2024 | running | job 30521 is running |
| Nov 23 12:52:40 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Nov 23 12:52:40 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Use force to rebuild CUDA...
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80
Updates by the bot instance eessi-bot-riscv
(click for details)
- account
trz42has NO permission to send commands to the bot
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80fromtrz42- expanded format:
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
- expanded format:
-
handling command
build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80resulted in:- submitted job
30522, for details & status see https://github.com/EESSI/software-layer/pull/825#issuecomment-2495441487
- submitted job
Updates by the bot instance eessi-bot-riscv
(click for details)
- account
trz42has NO permission to send commands to the bot