software-layer icon indicating copy to clipboard operation
software-layer copied to clipboard

{2023.06}[2023a] PyTorch-Bundle v2.1.2

Open casparvl opened this issue 1 year ago • 86 comments

15 out of 137 required modules missing:

* parameterized/0.9.0-GCCcore-12.3.0 (parameterized-0.9.0-GCCcore-12.3.0.eb)
* tqdm/4.66.1-GCCcore-12.3.0 (tqdm-4.66.1-GCCcore-12.3.0.eb)
* LLVM/14.0.6-GCCcore-12.3.0-llvmlite (LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb)
* Scalene/1.5.26-GCCcore-12.3.0 (Scalene-1.5.26-GCCcore-12.3.0.eb)
* gperftools/2.12-GCCcore-12.3.0 (gperftools-2.12-GCCcore-12.3.0.eb)
* SentencePiece/0.2.0-GCC-12.3.0 (SentencePiece-0.2.0-GCC-12.3.0.eb)
* tensorboard/2.15.1-gfbf-2023a (tensorboard-2.15.1-gfbf-2023a.eb)
* imageio/2.33.1-gfbf-2023a (imageio-2.33.1-gfbf-2023a.eb)
* libmad/0.15.1b-GCCcore-12.3.0 (libmad-0.15.1b-GCCcore-12.3.0.eb)
* SoX/14.4.2-GCCcore-12.3.0 (SoX-14.4.2-GCCcore-12.3.0.eb)
* NLTK/3.8.1-foss-2023a (NLTK-3.8.1-foss-2023a.eb)
* numba/0.58.1-foss-2023a (numba-0.58.1-foss-2023a.eb)
* scikit-image/0.22.0-foss-2023a (scikit-image-0.22.0-foss-2023a.eb)
* librosa/0.10.1-foss-2023a (librosa-0.10.1-foss-2023a.eb)
* PyTorch-bundle/2.1.2-foss-2023a (PyTorch-bundle-2.1.2-foss-2023a.eb)

casparvl avatar May 23 '24 09:05 casparvl

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

eessi-bot[bot] avatar May 23 '24 09:05 eessi-bot[bot]

Instance eessi-bot-mc-azure is configured to build:

  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-software

eessi-bot[bot] avatar May 23 '24 09:05 eessi-bot[bot]

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

casparvl avatar May 23 '24 09:05 casparvl

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 resulted in:

    • submitted job 11283, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2126640524

eessi-bot[bot] avatar May 23 '24 09:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • account casparvl has NO permission to send commands to the bot

eessi-bot[bot] avatar May 23 '24 09:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_585/11283

date job status comment
May 23 09:23:40 UTC 2024 submitted job id 11283 awaits release by job manager
May 23 09:24:02 UTC 2024 released job awaits launch by Slurm scheduler
May 23 09:28:04 UTC 2024 running job 11283 is running
May 23 09:33:17 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11283.out
:x: found message matching ERROR:
:white_check_mark: no message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
May 23 09:33:17 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11283.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 23 '24 09:05 eessi-bot[bot]

== No easyconfigs left to be built.
ERROR: Missing dependencies: SentencePiece/0.2.0-foss-2023a, SoX/14.4.2-foss-2023a (no easyconfig file or existing module found)
== Build succeeded for 0 out of 0
  >> download succeeded: https://github.com/easybuilders/easybuild-easyconfigs/archive/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> running command:
        [started at: 2024-05-23 09:28:34]
        [working dir: /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders]
        [output logged in /tmp/eb-dlj1ws2x/eb-9tn8fu3_/easybuild-run_cmd-t6inmlw4.log]
        tar xzf /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> command completed: exit 0, ran in 00h00m01s
== found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EasyBuild/4.9.1/easybuild/easyconfigs, so using it...
== Running parse hook for PyTorch-bundle-2.1.2-foss-2023a.eb...
== Running parse hook for foss-2023a.eb...
== resolving dependencies ...
== Running parse hook for parameterized-0.9.0-GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for scikit-image-0.22.0-foss-2023a.eb...
== Running parse hook for librosa-0.10.1-foss-2023a.eb...
== Running parse hook for imageio-2.33.1-gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FlexiBLAS-3.3.1-GCC-12.3.0.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FFTW-3.3.10-GCC-12.3.0.eb...
== Running parse hook for NLTK-3.8.1-foss-2023a.eb...
== Running parse hook for numba-0.58.1-foss-2023a.eb...
== Running parse hook for Scalene-1.5.26-GCCcore-12.3.0.eb...
== Running parse hook for tqdm-4.66.1-GCCcore-12.3.0.eb...
== Running parse hook for LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb...
== Running parse hook for tensorboard-2.15.1-gfbf-2023a.eb...

I guess that with --from-pr we got SentencePiece and Sox correctly since they were already in develop, but with --from-commit we don't? Should I combine multiple --from-commit's for each of those (i.e. look up the commit that provided the required SentencePiece, etc)?

casparvl avatar May 23 '24 09:05 casparvl

== No easyconfigs left to be built.
ERROR: Missing dependencies: SentencePiece/0.2.0-foss-2023a, SoX/14.4.2-foss-2023a (no easyconfig file or existing module found)
== Build succeeded for 0 out of 0
  >> download succeeded: https://github.com/easybuilders/easybuild-easyconfigs/archive/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> running command:
        [started at: 2024-05-23 09:28:34]
        [working dir: /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders]
        [output logged in /tmp/eb-dlj1ws2x/eb-9tn8fu3_/easybuild-run_cmd-t6inmlw4.log]
        tar xzf /tmp/eb-dlj1ws2x/eb-9tn8fu3_/tmpp3me5uio/easybuilders/7124863ed588066e5a988b4073d91381497a7f64.tar.gz
  >> command completed: exit 0, ran in 00h00m01s
== found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/EasyBuild/4.9.1/easybuild/easyconfigs, so using it...
== Running parse hook for PyTorch-bundle-2.1.2-foss-2023a.eb...
== Running parse hook for foss-2023a.eb...
== resolving dependencies ...
== Running parse hook for parameterized-0.9.0-GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for GCCcore-12.3.0.eb...
== Running parse hook for scikit-image-0.22.0-foss-2023a.eb...
== Running parse hook for librosa-0.10.1-foss-2023a.eb...
== Running parse hook for imageio-2.33.1-gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for gfbf-2023a.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FlexiBLAS-3.3.1-GCC-12.3.0.eb...
== Running parse hook for GCC-12.3.0.eb...
== Running parse hook for FFTW-3.3.10-GCC-12.3.0.eb...
== Running parse hook for NLTK-3.8.1-foss-2023a.eb...
== Running parse hook for numba-0.58.1-foss-2023a.eb...
== Running parse hook for Scalene-1.5.26-GCCcore-12.3.0.eb...
== Running parse hook for tqdm-4.66.1-GCCcore-12.3.0.eb...
== Running parse hook for LLVM-14.0.6-GCCcore-12.3.0-llvmlite.eb...
== Running parse hook for tensorboard-2.15.1-gfbf-2023a.eb...

I guess that with --from-pr we got SentencePiece and Sox correctly since they were already in develop, but with --from-commit we don't? Should I combine multiple --from-commit's for each of those (i.e. look up the commit that provided the required SentencePiece, etc)?

I (and @trz42 and @ocaisa ) also saw issues with using --from-commit, see for instance https://github.com/EESSI/software-layer/pull/558#issuecomment-2090836084.

bedroge avatar May 23 '24 09:05 bedroge

Could you try using the merge commit (see bottom of the PR: 04ccd901a613631b00ccbe504d6d66d6a6c2febb) and check if that does work?

bedroge avatar May 23 '24 09:05 bedroge

I tried manually

eb -D PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

But that still shows missing EasyConfigs.

casparvl avatar May 23 '24 10:05 casparvl

I tried manually

eb -D PyTorch-bundle-2.1.2-foss-2023a-CUDA-12.1.1.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

But that still shows missing EasyConfigs.

Guess we need to stick to --from-pr then until we find a solution for this...

bedroge avatar May 23 '24 10:05 bedroge

I was being stupid. I made a mistake in what I ran manually: that's with CUDA. That's not included in that PR/commit for sure... :P However,

eb -D PyTorch-bundle-2.1.2-foss-2023a.eb --from-commit 04ccd901a613631b00ccbe504d6d66d6a6c2febb

shows the same missing easyconfigs. I've switched to --from-pr for now. I'll try to create an upstream issue on EasyBuild later (if there isn't any yet).

casparvl avatar May 23 '24 11:05 casparvl

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

casparvl avatar May 23 '24 11:05 casparvl

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 from casparvl

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 resulted in:

    • submitted job 11288, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2126916521

eessi-bot[bot] avatar May 23 '24 11:05 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • account casparvl has NO permission to send commands to the bot

eessi-bot[bot] avatar May 23 '24 11:05 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_585/11288

date job status comment
May 23 11:50:20 UTC 2024 submitted job id 11288 awaits release by job manager
May 23 11:50:42 UTC 2024 released job awaits launch by Slurm scheduler
May 23 11:55:44 UTC 2024 running job 11288 is running
May 23 12:23:21 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-11288.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1716466678.tar.gzsize: 162 MiB (170601270 bytes)
entries: 6321
modules under 2023.06/software/linux/x86_64/amd/zen3/modules/all
imageio/2.33.1-gfbf-2023a.lua
LLVM/14.0.6-GCCcore-12.3.0-llvmlite.lua
NLTK/3.8.1-foss-2023a.lua
numba/0.58.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/x86_64/amd/zen3/software
imageio/2.33.1-gfbf-2023a
LLVM/14.0.6-GCCcore-12.3.0-llvmlite
NLTK/3.8.1-foss-2023a
numba/0.58.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/x86_64/amd/zen3
no other files in tarball
May 23 12:23:21 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-11288.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar May 23 '24 11:05 eessi-bot[bot]

This is the actual failure:

== 2024-05-23 12:17:16,011 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: extensions sanity check failed for 1 extensions: soundfile
failing sanity check for 'soundfile' extension: command "python -c "import soundfile"" failed; output:
Traceback (most recent call last):
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 161, in <module>
    import _soundfile_data  # ImportError if this doesn't exist
    ^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named '_soundfile_data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 171, in <module>
    _snd = _ffi.dlopen(_libname)
           ^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so.1': libsndfile.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 192, in <module>
    _snd = _ffi.dlopen(_explicit_libname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory,  (at easybuild/framework/easyblock.py:3669 in _sanity_check_step)

I guess this should be provide by the module libsndfile/1.2.2-GCCcore-12.3.0, but I'm not sure what path's get searched by this dlopen call. I think that searches LD_LIBRARY_PATH, which we don't set in EESSI.

I guess this is a pretty fundamental question: how do we make dlopen calls succesfully find libs from the EESSI software prefix?

casparvl avatar May 23 '24 13:05 casparvl

See https://github.com/EESSI/software-layer/issues/192 , the Alliance have a solution for this

ocaisa avatar May 23 '24 13:05 ocaisa

Spot on, it is indeed the issue of ctypes.util's find_library only returning the filename, not the full path. Or at least: I see that it is using find_library here to ge tthe _libname, which is then used as the dlopen argument. I.e. I expect that if find_library correctly returns the full path, the dlopen call would have succeeded.

The downside is that the Alliance's solution looks quite involved... The upside is we can probably use their shadowing lib from https://github.com/ComputeCanada/custom_ctypes/tree/main/lib . What I don't fully understand is the sitecustomize and ebpythonprefixes stuff they do. Also, they seem to make a seperate module out of it, I'm not entirely sure why (do they only load it when they need to?).

I guess my main consideration would be if we shouldn't just always have this patched find_library function in place. In that case, a simple patch to the installation that normally contains ctypes (I guess that's in the standard Python installation?) would then be enough...

casparvl avatar May 23 '24 14:05 casparvl

I was also thinking that maybe a patch on ctypes is enough, I don't fully understand all the other stuff going on with them

ocaisa avatar May 23 '24 15:05 ocaisa

The changes they apply to ctypes are quite small. See below for Python/3.11.3 Maybe we could apply these changes "in-place" in a build container to test if they solve the issue?

diff -u /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py custom_ctypes/lib/python3.11/site-packages/ctypes/util.py
--- /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py      2024-04-30 16:38:09.000000000 +0200
+++ custom_ctypes/lib/python3.11/site-packages/ctypes/util.py   2024-05-30 16:17:44.000000000 +0200
@@ -326,7 +326,10 @@

         def find_library(name):
             # See issue #9998
+            lib = _findLib_gcc(name)
+            # return absolute path
             return _findSoname_ldconfig(name) or \
+                    os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
                    _get_soname(_findLib_gcc(name)) or _get_soname(_findLib_ld(name))

 ################################################################

trz42 avatar May 30 '24 14:05 trz42

I tried to replace the utils.py globally (for all installations in https://github.com/NorESSI/software-layer/pull/387), but that leads to a failure when building/installing scikitimage already (third package). See below for details. When I don't use the modified utils.py it fails with the same error @casparvl has hit when building librosa.

    File "/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/ctypes/util.py", line 332, in find_library
      os.path.join(os.path.dirname(lib), _get_soname(lib)) or \
                   ^^^^^^^^^^^^^^^^^^^^
    File "<frozen posixpath>", line 152, in dirname
  TypeError: expected str, bytes or os.PathLike object, not NoneType
  error: subprocess-exited-with-error

Will try to use that modified file only when building/using librosa.

trz42 avatar May 31 '24 13:05 trz42

I've worked out a fix for the import soundfile issue. See https://github.com/NorESSI/software-layer/pull/391

If it works out there, I'll test it with PyTorch-bundle. We can dicuss how we should employ this fix (maybe it's better to ship the custom ctypes with EESSI, but for lack of better idea where to put it the above PR puts it under host_injections).

trz42 avatar Jun 04 '24 12:06 trz42

I updated https://github.com/NorESSI/software-layer/pull/387 with the fixes in https://github.com/NorESSI/software-layer/pull/391 to work around the failing sanity check (python -c 'import soundfile'). PyTorch (with CUDA) builds for x86_64/{generic,intel/skylake_avx512,amd/zen2}. It fails for aarch64/generic and x86_64/intel/broadwell with a different issue. It could be worth applying the fixes also here and see which builds work (and which don't).

trz42 avatar Jun 06 '24 18:06 trz42

@trz42 I remember you said in a meeting that simply patching ctypes caused issues in other packages. I think the idea was then to pick up a 'patched' ctypes only for a specific phase of the build (the test step? I don't fully remember...). However, it was also brought up in that meeting that this fix would make the build pass, but users would still run into it at runtime, right?

I was thinking: what if we patch ctypes to add a different API call. I.e. a find_library with an extra argument full_path (which defaults to false, i.e. the default behaviour). And then, we patch librosa to call find_library(..., full_path=true). That way, you only get the full path back if you intentionaly patch an application that depends on this find_library call. That should have no unintended fallout (because the default function call retains it's prior behaviour of only returning the library name, not the full library path), while giving us an easy way to fix future similar issues (simply patch the function calls to find_library to add the full_path=true argument). It would also mean it is solved for these packages at runtime as well (we simply patched the package).

Now, this would be super annoying if there are packages that do a lot of find_library calls, since it means a lot of patching. But I assume that should be pretty limited (I mean... how many external libraries can a single package use, right...? Or did I now jynx it :P)

casparvl avatar Jun 10 '24 11:06 casparvl

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3

ocaisa avatar Aug 07 '24 11:08 ocaisa

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 resulted in:

    • submitted job 15837, for details & status see https://github.com/EESSI/software-layer/pull/585#issuecomment-2273252775

eessi-bot[bot] avatar Aug 07 '24 11:08 eessi-bot[bot]

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 resulted in:

    • no jobs were submitted

eessi-bot[bot] avatar Aug 07 '24 11:08 eessi-bot[bot]

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_585/15837

date job status comment
Aug 07 11:30:23 UTC 2024 submitted job id 15837 awaits release by job manager
Aug 07 11:30:57 UTC 2024 released job awaits launch by Slurm scheduler
Aug 07 11:36:00 UTC 2024 running job 15837 is running
Aug 07 12:38:08 UTC 2024 finished
:cry: FAILURE (click triangle for details)
Details
:white_check_mark: job output file slurm-15837.out
:x: found message matching ERROR:
:x: found message matching FAILED:
:x: found message matching required modules missing:
:x: no message matching No missing installations
:white_check_mark: found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1723033285.tar.gzsize: 144 MiB (151384274 bytes)
entries: 4814
modules under 2023.06/software/linux/x86_64/amd/zen3/modules/all
gperftools/2.12-GCCcore-12.3.0.lua
imageio/2.33.1-gfbf-2023a.lua
libmad/0.15.1b-GCCcore-12.3.0.lua
NLTK/3.8.1-foss-2023a.lua
parameterized/0.9.0-GCCcore-12.3.0.lua
Scalene/1.5.26-GCCcore-12.3.0.lua
scikit-image/0.22.0-foss-2023a.lua
SentencePiece/0.2.0-GCC-12.3.0.lua
SoX/14.4.2-GCCcore-12.3.0.lua
tensorboard/2.15.1-gfbf-2023a.lua
tqdm/4.66.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/x86_64/amd/zen3/software
gperftools/2.12-GCCcore-12.3.0
imageio/2.33.1-gfbf-2023a
libmad/0.15.1b-GCCcore-12.3.0
NLTK/3.8.1-foss-2023a
parameterized/0.9.0-GCCcore-12.3.0
Scalene/1.5.26-GCCcore-12.3.0
scikit-image/0.22.0-foss-2023a
SentencePiece/0.2.0-GCC-12.3.0
SoX/14.4.2-GCCcore-12.3.0
tensorboard/2.15.1-gfbf-2023a
tqdm/4.66.1-GCCcore-12.3.0
other under 2023.06/software/linux/x86_64/amd/zen3
no other files in tarball
Aug 07 12:38:08 UTC 2024 test result
:grin: SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 17/17 test case(s) from 17 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
:white_check_mark: job output file slurm-15837.out
:x: found message matching ERROR:
:white_check_mark: no message matching [\s*FAILED\s*].*Ran .* test case

eessi-bot[bot] avatar Aug 07 '24 11:08 eessi-bot[bot]

=========================== short test summary info ============================
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[None-ImageReadMode.UNCHANGED-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[L-ImageReadMode.GRAY-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-grace_hopper_517x606.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-cmyk_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-gray_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg[RGB-ImageReadMode.RGB-rgb_pytorch.jpg]
FAILED test/test_image.py::test_decode_jpeg_errors - AssertionError: Regex pa...
FAILED test/test_image.py::test_decode_bad_huffman_images - RuntimeError: dec...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt.jpg] - Asserti...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_2.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_3.jpg] - Ass...
FAILED test/test_image.py::test_damaged_corrupt_images[corrupt34_4.jpg] - Ass...
FAILED test/test_image.py::test_encode_jpeg_errors - AssertionError: Regex pa...
FAILED test/test_image.py::test_encode_jpeg[grace_hopper_517x606.jpg] - Runti...
FAILED test/test_image.py::test_write_jpeg[grace_hopper_517x606.jpg] - Runtim...
= 21 failed, 48811 passed, 50354 skipped, 2503 deselected, 2220 warnings in 965.82s (0:16:05) =

All of the failures look something like this:

=================================== FAILURES ===================================
___ test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg] ____
test/test_image.py:94: in test_decode_jpeg
    img_ljpeg = decode_image(data, mode=mode)
/tmp/eb-fwlstir4/eb-ghhapv8m/tmpxrxoma_b/lib/python3.11/site-packages/torchvision/io/image.py:236: in decode_image
    output = torch.ops.image.decode_image(input, mode.value)
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/PyTorch/2.1.2-foss-2023a/lib/python3.11/site-packages/torch/_ops.py:692: in __call__
    return self._op(*args, **kwargs or {})
E   RuntimeError: decode_jpeg: torchvision not compiled with libjpeg support

casparvl avatar Aug 08 '24 08:08 casparvl