software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1

Open trz42 opened this issue 1 year ago • 67 comments

WORK IN PROGRESS

Eventually, this is aimed at adding PyTorch/2.1.2 with CUDA/12.1.1. However, building it may not work out of the box, so this is for documenting the progress, issues we hit and workarounds applied.

PyTorch with CUDA requires cuDNN, hence this PR also builds it using the same changes provided by #581 and #579 (however, the changes by the latter would have to be ingested, hence we need additional changes here; we try to document well what we do, and why).

Initially, we only build for compute capability 7.0, later we build for architectures from Pascal but excluding architectures for embedded GPUs and very special compute capabilities such as 9.0a. That is the list of compute capabilities could be 6.0,6.1,7.0,7.5,8.0,8.6,8.9,9.0

trz42 avatar May 24 '24 07:05 trz42

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

eessi-bot[bot] avatar May 24 '24 07:05 eessi-bot[bot]

Instance eessi-bot-mc-azure is configured to build:

  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-software

eessi-bot[bot] avatar May 24 '24 07:05 eessi-bot[bot]

We run a first attempt without doing any modifications (e.g., to work around issues)...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

trz42 avatar May 24 '24 07:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • submitted job 11348, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2128783470

eessi-bot[bot] avatar May 24 '24 07:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 24 '24 07:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11348

  • failed with
You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
  • we also need the changes from #579
date job status comment
May 24 07:24:07 UTC 2024 submitted job id 11348 awaits release by job manager
May 24 07:24:09 UTC 2024 released job awaits launch by Slurm scheduler
May 24 07:25:11 UTC 2024 running job 11348 is running
May 24 07:39:24 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11348.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716535927.tar.gzsize: 698 MiB (732486169 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 24 07:39:25 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11348.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 24 '24 07:05 eessi-bot[bot]

Building after applied changes provided by #579...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

trz42 avatar May 24 '24 08:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • submitted job 11349, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2128864440

eessi-bot[bot] avatar May 24 '24 08:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 24 '24 08:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11349

  • failed with the same error (possibly because the environment variable EESSI_OVERRIDE_GPU_CHECK is not set or not passed through to the Prefix shell)
You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
  • need to add some code for passing that environment variable into the Prefix shell (see https://github.com/EESSI/software-layer/pull/586/commits/58120d2891afd2126ac737d59ea915c2c7472c74)
date job status comment
May 24 08:07:29 UTC 2024 submitted job id 11349 awaits release by job manager
May 24 08:08:30 UTC 2024 released job awaits launch by Slurm scheduler
May 24 08:09:32 UTC 2024 running job 11349 is running
May 24 08:23:46 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11349.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716538578.tar.gzsize: 698 MiB (732497279 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 24 08:23:46 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11349.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 24 '24 08:05 eessi-bot[bot]

Trying again...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

trz42 avatar May 24 '24 09:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • submitted job 11357, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129011708

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11357

  • still the same error
You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
  • we need to make sure that the environment variable is actually set (does https://github.com/EESSI/software-layer/pull/579/commits/f788ca3ab94ab384ee2e4a98e5b76e2a9317102f solve the issue ?)
date job status comment
May 24 09:04:09 UTC 2024 submitted job id 11357 awaits release by job manager
May 24 09:04:52 UTC 2024 released job awaits launch by Slurm scheduler
May 24 09:05:54 UTC 2024 running job 11357 is running
May 24 09:20:10 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11357.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716541955.tar.gzsize: 698 MiB (732480493 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 24 09:20:10 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11357.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

One more time...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

trz42 avatar May 24 '24 09:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • submitted job 11368, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129082424

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11368

  • same result as before
You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
date job status comment
May 24 09:37:14 UTC 2024 submitted job id 11368 awaits release by job manager
May 24 09:37:18 UTC 2024 released job awaits launch by Slurm scheduler
May 24 09:38:20 UTC 2024 running job 11368 is running
May 24 09:52:46 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11368.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716543905.tar.gzsize: 698 MiB (732479970 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 24 09:52:46 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11368.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

And now trying to run the build step with --nvidia install instead of --nvidia all...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

trz42 avatar May 24 '24 09:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • submitted job 11369, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129089305

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11369

  • also here ... same issue as every job before
You requested to load UCX-CUDA  which relies on the CUDA runtime environment
and driver libraries. In order to be able to use the module, you will need to
make sure EESSI can find the GPU driver libraries on your host system.
For more information on how to do this, see https://www.eessi.io/docs/gpu/.

While processing the following module(s):
    Module fullname                             Module Filename
    ---------------                             ---------------
    UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
date job status comment
May 24 09:41:18 UTC 2024 submitted job id 11369 awaits release by job manager
May 24 09:41:24 UTC 2024 released job awaits launch by Slurm scheduler
May 24 09:46:33 UTC 2024 running job 11369 is running
May 24 10:00:55 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11369.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716544458.tar.gzsize: 698 MiB (732496619 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 24 10:00:55 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11369.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 24 '24 09:05 eessi-bot[bot]

How "fat" is this PyTorch installation? Since it is using CUDA/12 it should really be supporting all compute capabilities from 5.0 to 9.0 if we want to keep our same software everywhere promise...

ocaisa avatar May 24 '24 09:05 ocaisa

Another try...

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

trz42 avatar May 24 '24 10:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • submitted job 11370, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129236479

eessi-bot[bot] avatar May 24 '24 10:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 24 '24 10:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_586/11370

date job status comment
May 24 10:53:13 UTC 2024 submitted job id 11370 awaits release by job manager
May 24 10:54:09 UTC 2024 released job awaits launch by Slurm scheduler
May 24 10:55:11 UTC 2024 running job 11370 is running
May 24 11:08:26 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11370.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716548506.tar.gzsize: 698 MiB (732494522 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 24 11:08:26 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11370.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 24 '24 10:05 eessi-bot[bot]

How "fat" is this PyTorch installation? Since it is using CUDA/12 it should really be supporting all compute capabilities from 5.0 to 9.0 if we want to keep our same software everywhere promise...

Not fat at all. It's more an attempt to get something built, see what problems we hit (possibly the same as in https://github.com/NorESSI/software-layer/pull/369) and if any fixes applied to the latter PR also solve issues here.

trz42 avatar May 24 '24 10:05 trz42

Does it work now?

bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2

trz42 avatar May 24 '24 11:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:zen2 resulted in:

    • submitted job 11371, for details & status see https://github.com/EESSI/software-layer/pull/586#issuecomment-2129305747

eessi-bot[bot] avatar May 24 '24 11:05 eessi-bot[bot]