software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

Test if non CUDA builds are not added to accelorator path with jax

Open laraPPr opened this issue 9 months ago • 21 comments

5 out of 86 required modules missing:



* absl-py/2.1.0-GCCcore-12.3.0 (absl-py-2.1.0-GCCcore-12.3.0.eb)

* pytest/7.4.2-GCCcore-12.3.0 (pytest-7.4.2-GCCcore-12.3.0.eb)

* pytest-xdist/3.3.1-GCCcore-12.3.0 (pytest-xdist-3.3.1-GCCcore-12.3.0.eb)

* ml_dtypes/0.3.2-gfbf-2023a (ml_dtypes-0.3.2-gfbf-2023a.eb)

* jax/0.4.25-gfbf-2023a-CUDA-12.1.1 (jax-0.4.25-gfbf-2023a-CUDA-12.1.1.eb)

laraPPr avatar Feb 13 '25 13:02 laraPPr

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot[bot] avatar Feb 13 '25 13:02 eessi-bot[bot]

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot[bot] avatar Feb 13 '25 13:02 eessi-bot[bot]

Instance eessi-bot-casparvl is configured to build for:

  • architectures: x86_64/amd/zen4, x86_64/amd/zen2
  • repositories: eessi.io-2023.06-software, eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat

Instance eessi-bot-vsc-ugent is configured to build for:

  • architectures: x86_64/amd/zen3
  • repositories: eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat

gpu-bot-ugent[bot] avatar Feb 13 '25 13:02 gpu-bot-ugent[bot]

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

eessi-bot-trz42[bot] avatar Feb 13 '25 13:02 eessi-bot-trz42[bot]

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80

laraPPr avatar Feb 13 '25 13:02 laraPPr

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar Feb 13 '25 13:02 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar Feb 13 '25 13:02 eessi-bot[bot]

Updates by the bot instance eessi-bot-casparvl (click for details)
  • account laraPPr has NO permission to send commands to the bot

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • submitted job 15445297, for details & status see https://github.com/EESSI/software-layer/pull/917#issuecomment-2656598059

gpu-bot-ugent[bot] avatar Feb 13 '25 13:02 gpu-bot-ugent[bot]

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

eessi-bot-trz42[bot] avatar Feb 13 '25 13:02 eessi-bot-trz42[bot]

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.02/pr_917/15445297

date job status comment
Feb 13 13:25:54 UTC 2025 submitted job id 15445297 awaits release by job manager
Feb 13 13:26:58 UTC 2025 released job awaits launch by Slurm scheduler
Feb 13 13:29:02 UTC 2025 running job 15445297 is running
Feb 13 15:05:06 UTC 2025 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-15445297.out
:white_check_mark: no message matching FATAL:
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1739456837.tar.gzsize: 6 MiB (6667595 bytes)
entries: 1191
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
absl-py/2.1.0-GCCcore-12.3.0.lua
ml_dtypes/0.3.2-gfbf-2023a.lua
pytest/7.4.2-GCCcore-12.3.0.lua
pytest-xdist/3.3.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
absl-py/2.1.0-GCCcore-12.3.0
ml_dtypes/0.3.2-gfbf-2023a
pytest/7.4.2-GCCcore-12.3.0
pytest-xdist/3.3.1-GCCcore-12.3.0
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
no other files in tarball
Feb 13 15:05:06 UTC 2025 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/1) EESSI_LAMMPS_lj %device_type=gpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1 %scale=1_4_node /497af4b1 @BotBuildTests:x86_64_amd_zen3_nvidia_cc80+default
P: perf: 4447.069 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-15445297.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

gpu-bot-ugent[bot] avatar Feb 13 '25 13:02 gpu-bot-ugent[bot]

@trz42 @ocaisa this looks like it is not doing what we expect it to do because it seems to be installing pytest-xdist in the accelerator path.

/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/bin/python -m pip install --prefix=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software/pytest-xdist/3.3.1-GCCcore-12.3.0  --verbose  --no-deps  --ignore-installed  --no-index  --no-build-isolation  .

laraPPr avatar Feb 13 '25 13:02 laraPPr

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

laraPPr avatar Feb 13 '25 13:02 laraPPr

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from laraPPr

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • submitted job 45927, for details & status see https://github.com/EESSI/software-layer/pull/917#issuecomment-2656643474

eessi-bot[bot] avatar Feb 13 '25 13:02 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from laraPPr

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar Feb 13 '25 13:02 eessi-bot[bot]

Updates by the bot instance eessi-bot-casparvl (click for details)
  • account laraPPr has NO permission to send commands to the bot

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from laraPPr

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

gpu-bot-ugent[bot] avatar Feb 13 '25 13:02 gpu-bot-ugent[bot]

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

eessi-bot-trz42[bot] avatar Feb 13 '25 13:02 eessi-bot-trz42[bot]

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.02/pr_917/45927

date job status comment
Feb 13 13:43:54 UTC 2025 submitted job id 45927 awaits release by job manager
Feb 13 13:44:41 UTC 2025 released job awaits launch by Slurm scheduler
Feb 13 13:53:28 UTC 2025 running job 45927 is running
Feb 13 14:18:15 UTC 2025 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-45927.out
:white_check_mark: no message matching FATAL:
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1739455531.tar.gzsize: 6 MiB (6661749 bytes)
entries: 1191
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
absl-py/2.1.0-GCCcore-12.3.0.lua
ml_dtypes/0.3.2-gfbf-2023a.lua
pytest/7.4.2-GCCcore-12.3.0.lua
pytest-xdist/3.3.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
absl-py/2.1.0-GCCcore-12.3.0
ml_dtypes/0.3.2-gfbf-2023a
pytest/7.4.2-GCCcore-12.3.0
pytest-xdist/3.3.1-GCCcore-12.3.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
no other files in tarball
Feb 13 14:18:15 UTC 2025 test result
:cry: FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
:white_check_mark: job output file slurm-45927.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Feb 13 '25 13:02 eessi-bot[bot]

It failed building jax with this error:

FAILED: Installation ended unsuccessfully (build directory: /tmp/vsc48506/easybuild/build/jax/0.4.25/gfbf-2023a-CUDA-12.1.1): build failed (first 300 chars): Failed to determine installation prefix for binutils (took 39 mins 48 secs

and as you can see in the artifacts the non enabled cuda builds were build in the accelerator path.

laraPPr avatar Feb 13 '25 15:02 laraPPr

Closing so I can test a new action for filtering

laraPPr avatar Jun 04 '25 10:06 laraPPr

reopen to test replacement for dorny

laraPPr avatar Jun 06 '25 14:06 laraPPr

Filtering works and check fail as expected

laraPPr avatar Jun 06 '25 15:06 laraPPr