software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

{2023.06}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1

Open casparvl opened this issue 1 year ago • 12 comments

casparvl avatar Sep 18 '24 19:09 casparvl

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot[bot] avatar Sep 18 '24 19:09 eessi-bot[bot]

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot[bot] avatar Sep 18 '24 19:09 eessi-bot[bot]

Instance boegel-bot-deucalion is configured to build for:

  • architectures: aarch64/a64fx
  • repositories: eessi.io-2023.06-software

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

casparvl avatar Sep 27 '24 08:09 casparvl

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • submitted job 20153, for details & status see https://github.com/EESSI/software-layer/pull/717#issuecomment-2378742619

eessi-bot[bot] avatar Sep 27 '24 08:09 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar Sep 27 '24 08:09 eessi-bot[bot]

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account casparvl has NO permission to send commands to the bot

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.09/pr_717/20153

date job status comment
Sep 27 08:31:27 UTC 2024 submitted job id 20153 awaits release by job manager
Sep 27 08:31:37 UTC 2024 released job awaits launch by Slurm scheduler
Sep 27 08:36:39 UTC 2024 running job 20153 is running
Sep 27 16:44:00 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-20153.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1727450321.tar.gzsize: 1466 MiB (1537246599 bytes)
entries: 395
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
Bazel/6.1.0-GCCcore-12.3.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
ml_dtypes/0.3.2-gfbf-2023a.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
Bazel/6.1.0-GCCcore-12.3.0
cuDNN/8.9.2.26-CUDA-12.1.1
ml_dtypes/0.3.2-gfbf-2023a
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
no other files in tarball
Sep 27 16:44:00 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 9/9 test case(s) from 9 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-20153.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Sep 27 '24 08:09 eessi-bot[bot]

Let's see how this goes. Note that we need a proper cuDNN deployment that strips the necessary files first... So we will need to rebuild in any case.

casparvl avatar Sep 27 '24 08:09 casparvl

Let's see how this goes. Note that we need a proper cuDNN deployment that strips the necessary files first... So we will need to rebuild in any case.

I've marked this a draft, we definitely don't want to deploy with full cuDNN installation

boegel avatar Sep 27 '24 08:09 boegel

The build succeeded, but many tests failed due to:

ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

This is already available in the CPU-only stack, so I'm not sure why it didn't pick up the library from that module.

bedroge avatar Oct 01 '24 07:10 bedroge

The build succeeded, but many tests failed due to:

ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

This is already available in the CPU-only stack, so I'm not sure why it didn't pick up the library from that module.

Just opened https://github.com/easybuilders/easybuild-easyblocks/pull/3497 which may fix the libnccl.so.2 error.

trz42 avatar Oct 29 '24 12:10 trz42

@casparvl Can you retarget this pr?

laraPPr avatar Jun 27 '25 13:06 laraPPr