software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

{2023.06}[system] cuDNN/8.9.2.26-CUDA-12.1.1

Open trz42 opened this issue 1 year ago • 27 comments

requires:

  • #720

Attempt to add cuDNN which is a dependency of other packages such as TensorFlow and PyTorch.

Major additions/changes:

  • scripts/gpu_support/nvidia/install_cuda_and_libraries.sh with scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
    • script to install CUDA and cuDNN packages under .../host_injections
  • EESSI-install-software.sh
    • use scripts/gpu_support/nvidia/install_cuda_and_libraries.sh with scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml to install CUDA, cuDNN under .../host_injections
  • eb_hooks.py
    • put code that iterates over all files replacing non-distributable ones with symlinks into host_injections with a common function (replace_non_distributable_files_with_symlinks)
    • additional post_sanitycheck_hook which replaces files with symlinks into corresponding paths under .../host_injections for all files that cannot be redistributed
    • dropping dependency on cuDNN to a build dependency (see inject_gpu_property)
  • create_lmodsitepackage.py
    • consolidate eessi_{cuda,cudnn}_enabled_load_hook functions in a single one (eessi_cuda_and_libraries_enabled_load_hook)
    • the remaining hook is prepared to easily add new modules, e.g., cuTENSOR
  • install_scripts.sh
    • add files to copy to CVMFS (see nvidia_files)

trz42 avatar May 17 '24 09:05 trz42

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

eessi-bot[bot] avatar May 17 '24 09:05 eessi-bot[bot]

Instance eessi-bot-mc-azure is configured to build:

  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-software

eessi-bot[bot] avatar May 17 '24 09:05 eessi-bot[bot]

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

trz42 avatar May 17 '24 09:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • submitted job 10940, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2117129261

eessi-bot[bot] avatar May 17 '24 09:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 17 '24 09:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10940

date job status comment
May 17 09:26:27 UTC 2024 submitted job id 10940 awaits release by job manager
May 17 09:27:22 UTC 2024 released job awaits launch by Slurm scheduler
May 17 09:32:24 UTC 2024 running job 10940 is running
May 17 09:40:32 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-10940.out
:x: found message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715938433.tar.gzsize: 698 MiB (732495131 bytes)
entries: 74
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 09:40:32 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-10940.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 17 '24 09:05 eessi-bot[bot]

Retry after fixing args to cuDNN install script...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

trz42 avatar May 17 '24 10:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • submitted job 10941, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2117292658

eessi-bot[bot] avatar May 17 '24 10:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 17 '24 10:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10941

date job status comment
May 17 10:45:01 UTC 2024 submitted job id 10941 awaits release by job manager
May 17 10:45:40 UTC 2024 released job awaits launch by Slurm scheduler
May 17 10:49:42 UTC 2024 running job 10941 is running
May 17 10:59:52 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-10941.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715943174.tar.gzsize: 698 MiB (732493432 bytes)
entries: 74
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 10:59:52 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-10941.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 17 '24 10:05 eessi-bot[bot]

@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should?

ocaisa avatar May 17 '24 11:05 ocaisa

@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should?

Full package is 1.4 GB.

trz42 avatar May 17 '24 11:05 trz42

Rebuild after changing hook function that handles dependencies and creates modluafooter entries...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

trz42 avatar May 17 '24 12:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • submitted job 10942, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2117540885

eessi-bot[bot] avatar May 17 '24 12:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 17 '24 12:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10942

date job status comment
May 17 12:54:38 UTC 2024 submitted job id 10942 awaits release by job manager
May 17 12:55:03 UTC 2024 released job awaits launch by Slurm scheduler
May 17 13:00:06 UTC 2024 running job 10942 is running
May 17 13:05:11 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-10942.out
:x: found message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715950816.tar.gzsize: 0 MiB (15041 bytes)
entries: 3
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 13:05:11 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-10942.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 17 '24 12:05 eessi-bot[bot]

One more time...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

trz42 avatar May 17 '24 13:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • submitted job 10943, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2117581012

eessi-bot[bot] avatar May 17 '24 13:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 17 '24 13:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10943

date job status comment
May 17 13:14:32 UTC 2024 submitted job id 10943 awaits release by job manager
May 17 13:15:15 UTC 2024 released job awaits launch by Slurm scheduler
May 17 13:16:17 UTC 2024 running job 10943 is running
May 17 13:24:26 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-10943.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715951838.tar.gzsize: 698 MiB (732495999 bytes)
entries: 74
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 13:24:26 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-10943.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 17 '24 13:05 eessi-bot[bot]

@trz42 I will take your updated host_injections script for a test drive tomorrow, I think I have a few suggestions there and will open a PR to your branch

ocaisa avatar May 20 '24 14:05 ocaisa

I also get the feeling that if we are going to move to easystack files (a good idea) then we should probably ship the ones we expect people to use

ocaisa avatar May 20 '24 14:05 ocaisa

@trz42 I will take your updated host_injections script for a test drive tomorrow, I think I have a few suggestions there and will open a PR to your branch

Just updated the script with some improvements/fixes after my own testing.

trz42 avatar May 21 '24 07:05 trz42

Run another build after several changes...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

trz42 avatar May 23 '24 09:05 trz42

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • submitted job 11284, for details & status see https://github.com/EESSI/software-layer/pull/581#issuecomment-2126650177

eessi-bot[bot] avatar May 23 '24 09:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar May 23 '24 09:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/11284

date job status comment
May 23 09:28:36 UTC 2024 submitted job id 11284 awaits release by job manager
May 23 09:29:06 UTC 2024 released job awaits launch by Slurm scheduler
May 23 09:30:09 UTC 2024 running job 11284 is running
May 23 09:42:29 UTC 2024 finished
:grin: SUCCESS (click triangle for details)
Details
:white_check_mark: job output file slurm-11284.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching FAILED:
:white_check_mark: no message matching required modules missing:
:white_check_mark: found message(s) matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716456951.tar.gzsize: 698 MiB (732492073 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 23 09:42:29 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11284.out
:white_check_mark: no message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 23 '24 09:05 eessi-bot[bot]

@trz42 Can we close this now?

ocaisa avatar Nov 07 '24 10:11 ocaisa

The has been reimplemented in #772 and #798 so closing this PR (if I'm wrong @trz42 can reopen it)

ocaisa avatar Nov 07 '24 10:11 ocaisa

PR merged! Moved ['/project/def-users/SHARED/jobs/2024.05/pr_581/10940', '/project/def-users/SHARED/jobs/2024.05/pr_581/10941', '/project/def-users/SHARED/jobs/2024.05/pr_581/10942', '/project/def-users/SHARED/jobs/2024.05/pr_581/10943', '/project/def-users/SHARED/jobs/2024.05/pr_581/11284'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2024.11.07

eessi-bot[bot] avatar Nov 07 '24 10:11 eessi-bot[bot]