software-layer
software-layer copied to clipboard
[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1
WORK IN PROGRESS
Eventually, this is aimed at adding PyTorch/2.1.2 with CUDA/12.1.1. However, building it may not work out of the box, so this is for documenting the progress, issues we hit and workarounds applied.
PyTorch with CUDA requires cuDNN, hence this PR also builds it using the same changes provided by #581 and #579 (however, the changes by the latter would have to be ingested, hence we need additional changes here; we try to document well what we do, and why).
Initially, we only build for compute capability 7.0, later we build for architectures from Pascal but excluding architectures for embedded GPUs and very special compute capabilities such as 9.0a. That is the list of compute capabilities could be 6.0,6.1,7.0,7.5,8.0,8.6,8.9,9.0
Instance eessi-bot-mc-aws is configured to build:
- arch
x86_64/genericfor repoeessi-hpc.org-2023.06-compat - arch
x86_64/genericfor repoeessi-hpc.org-2023.06-software - arch
x86_64/genericfor repoeessi.io-2023.06-compat - arch
x86_64/genericfor repoeessi.io-2023.06-software - arch
x86_64/intel/haswellfor repoeessi-hpc.org-2023.06-compat - arch
x86_64/intel/haswellfor repoeessi-hpc.org-2023.06-software - arch
x86_64/intel/haswellfor repoeessi.io-2023.06-compat - arch
x86_64/intel/haswellfor repoeessi.io-2023.06-software - arch
x86_64/intel/skylake_avx512for repoeessi-hpc.org-2023.06-compat - arch
x86_64/intel/skylake_avx512for repoeessi-hpc.org-2023.06-software - arch
x86_64/intel/skylake_avx512for repoeessi.io-2023.06-compat - arch
x86_64/intel/skylake_avx512for repoeessi.io-2023.06-software - arch
x86_64/amd/zen2for repoeessi-hpc.org-2023.06-compat - arch
x86_64/amd/zen2for repoeessi-hpc.org-2023.06-software - arch
x86_64/amd/zen2for repoeessi.io-2023.06-compat - arch
x86_64/amd/zen2for repoeessi.io-2023.06-software - arch
x86_64/amd/zen3for repoeessi-hpc.org-2023.06-compat - arch
x86_64/amd/zen3for repoeessi-hpc.org-2023.06-software - arch
x86_64/amd/zen3for repoeessi.io-2023.06-compat - arch
x86_64/amd/zen3for repoeessi.io-2023.06-software - arch
aarch64/genericfor repoeessi-hpc.org-2023.06-compat - arch
aarch64/genericfor repoeessi-hpc.org-2023.06-software - arch
aarch64/genericfor repoeessi.io-2023.06-compat - arch
aarch64/genericfor repoeessi.io-2023.06-software - arch
aarch64/neoverse_n1for repoeessi-hpc.org-2023.06-compat - arch
aarch64/neoverse_n1for repoeessi-hpc.org-2023.06-software - arch
aarch64/neoverse_n1for repoeessi.io-2023.06-compat - arch
aarch64/neoverse_n1for repoeessi.io-2023.06-software - arch
aarch64/neoverse_v1for repoeessi-hpc.org-2023.06-compat - arch
aarch64/neoverse_v1for repoeessi-hpc.org-2023.06-software - arch
aarch64/neoverse_v1for repoeessi.io-2023.06-compat - arch
aarch64/neoverse_v1for repoeessi.io-2023.06-software
Instance eessi-bot-mc-azure is configured to build:
- arch
x86_64/amd/zen4for repoeessi-hpc.org-2023.06-compat - arch
x86_64/amd/zen4for repoeessi-hpc.org-2023.06-software - arch
x86_64/amd/zen4for repoeessi.io-2023.06-compat - arch
x86_64/amd/zen4for repoeessi.io-2023.06-software
We run a first attempt without doing any modifications (e.g., to work around issues)...
bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- submitted job
11348, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2128783470
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11348
- failed with
You requested to load UCX-CUDA which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1 /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
- we also need the changes from #579
| date | job status | comment |
|---|---|---|
| May 24 07:24:07 UTC 2024 | submitted | job id 11348 awaits release by job manager |
| May 24 07:24:09 UTC 2024 | released | job awaits launch by Slurm scheduler |
| May 24 07:25:11 UTC 2024 | running | job 11348 is running |
| May 24 07:39:24 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| May 24 07:39:25 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Building after applied changes provided by #579...
bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- submitted job
11349, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2128864440
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11349
- failed with the same error (possibly because the environment variable
EESSI_OVERRIDE_GPU_CHECKis not set or not passed through to the Prefix shell)
You requested to load UCX-CUDA which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1 /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
- need to add some code for passing that environment variable into the Prefix shell (see https://github.com/EESSI/software-layer/pull/586/commits/58120d2891afd2126ac737d59ea915c2c7472c74)
| date | job status | comment |
|---|---|---|
| May 24 08:07:29 UTC 2024 | submitted | job id 11349 awaits release by job manager |
| May 24 08:08:30 UTC 2024 | released | job awaits launch by Slurm scheduler |
| May 24 08:09:32 UTC 2024 | running | job 11349 is running |
| May 24 08:23:46 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| May 24 08:23:46 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Trying again...
bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- submitted job
11357, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129011708
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11357
- still the same error
You requested to load UCX-CUDA which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1 /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
- we need to make sure that the environment variable is actually set (does https://github.com/EESSI/software-layer/pull/579/commits/f788ca3ab94ab384ee2e4a98e5b76e2a9317102f solve the issue ?)
| date | job status | comment |
|---|---|---|
| May 24 09:04:09 UTC 2024 | submitted | job id 11357 awaits release by job manager |
| May 24 09:04:52 UTC 2024 | released | job awaits launch by Slurm scheduler |
| May 24 09:05:54 UTC 2024 | running | job 11357 is running |
| May 24 09:20:10 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| May 24 09:20:10 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
One more time...
bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- submitted job
11368, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129082424
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11368
- same result as before
You requested to load UCX-CUDA which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1 /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
| date | job status | comment |
|---|---|---|
| May 24 09:37:14 UTC 2024 | submitted | job id 11368 awaits release by job manager |
| May 24 09:37:18 UTC 2024 | released | job awaits launch by Slurm scheduler |
| May 24 09:38:20 UTC 2024 | running | job 11368 is running |
| May 24 09:52:46 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| May 24 09:52:46 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
And now trying to run the build step with --nvidia install instead of --nvidia all...
bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- submitted job
11369, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129089305
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11369
- also here ... same issue as every job before
You requested to load UCX-CUDA which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1 /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
| date | job status | comment |
|---|---|---|
| May 24 09:41:18 UTC 2024 | submitted | job id 11369 awaits release by job manager |
| May 24 09:41:24 UTC 2024 | released | job awaits launch by Slurm scheduler |
| May 24 09:46:33 UTC 2024 | running | job 11369 is running |
| May 24 10:00:55 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| May 24 10:00:55 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
How "fat" is this PyTorch installation? Since it is using CUDA/12 it should really be supporting all compute capabilities from 5.0 to 9.0 if we want to keep our same software everywhere promise...
Another try...
bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- submitted job
11370, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129236479
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11370
| date | job status | comment |
|---|---|---|
| May 24 10:53:13 UTC 2024 | submitted | job id 11370 awaits release by job manager |
| May 24 10:54:09 UTC 2024 | released | job awaits launch by Slurm scheduler |
| May 24 10:55:11 UTC 2024 | running | job 11370 is running |
| May 24 11:08:26 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| May 24 11:08:26 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
How "fat" is this PyTorch installation? Since it is using CUDA/12 it should really be supporting all compute capabilities from 5.0 to 9.0 if we want to keep our same software everywhere promise...
Not fat at all. It's more an attempt to get something built, see what problems we hit (possibly the same as in https://github.com/NorESSI/software-layer/pull/369) and if any fixes applied to the latter PR also solve issues here.
Does it work now?
bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build inst:aws repo:eessi.io-2023.06-software arch:zen2fromtrz42- expanded format:
build instance:aws repository:eessi.io-2023.06-software architecture:zen2
- expanded format:
-
handling command
build instance:aws repository:eessi.io-2023.06-software architecture:zen2resulted in:- submitted job
11371, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129305747
- submitted job