software-layer
software-layer copied to clipboard
[WIP] DEBUG only {2023.06,2023a} PyTorch-bundle v2.1.2
The main purpose of this PR is to facilitate debugging various issues when building PyTorch-bundle and demonstrating approaches that could solve the issues. It is expected that the fixes provided here are not final.
- ~includes a fix for
find_libraryprovided byctypes.utilwhich prevented importingsoundfile~- superseeded by fixing it in the Python installations
- includes a fix for
aarch64/{generic,neoverse_n1,neoverse_v1}where importingsentencepiecelead to the errorlibtcmalloc_minimal.so.4: cannot allocate memory in static TLS block - ~includes a fix for the extension
torchvisionwhere some library was not compiled withjpegsupport, hence some tests failed $\rightarrow$~- was fixed by https://github.com/easybuilders/easybuild-easyblocks/pull/3322
- we move to use EasyBuild/4.9.2 for building this PR because the updated easyblock for torchvision (PR 3322) has been released with that EasyBuild/4.9.2
Initially we will disable all fixes, build for selected architectures and document the errors. We then enable fixes one-by-one and document the results (some error fixed, some new errors, ...).
Note, see the original PR for PyTorch-bundle (https://github.com/EESSI/software-layer/pull/585) for additional discussion about some of the issues listed above.
Instance eessi-bot-mc-aws is configured to build:
- arch
x86_64/genericfor repoeessi-hpc.org-2023.06-compat - arch
x86_64/genericfor repoeessi-hpc.org-2023.06-software - arch
x86_64/genericfor repoeessi.io-2023.06-compat - arch
x86_64/genericfor repoeessi.io-2023.06-software - arch
x86_64/intel/haswellfor repoeessi-hpc.org-2023.06-compat - arch
x86_64/intel/haswellfor repoeessi-hpc.org-2023.06-software - arch
x86_64/intel/haswellfor repoeessi.io-2023.06-compat - arch
x86_64/intel/haswellfor repoeessi.io-2023.06-software - arch
x86_64/intel/skylake_avx512for repoeessi-hpc.org-2023.06-compat - arch
x86_64/intel/skylake_avx512for repoeessi-hpc.org-2023.06-software - arch
x86_64/intel/skylake_avx512for repoeessi.io-2023.06-compat - arch
x86_64/intel/skylake_avx512for repoeessi.io-2023.06-software - arch
x86_64/amd/zen2for repoeessi-hpc.org-2023.06-compat - arch
x86_64/amd/zen2for repoeessi-hpc.org-2023.06-software - arch
x86_64/amd/zen2for repoeessi.io-2023.06-compat - arch
x86_64/amd/zen2for repoeessi.io-2023.06-software - arch
x86_64/amd/zen3for repoeessi-hpc.org-2023.06-compat - arch
x86_64/amd/zen3for repoeessi-hpc.org-2023.06-software - arch
x86_64/amd/zen3for repoeessi.io-2023.06-compat - arch
x86_64/amd/zen3for repoeessi.io-2023.06-software - arch
aarch64/genericfor repoeessi-hpc.org-2023.06-compat - arch
aarch64/genericfor repoeessi-hpc.org-2023.06-software - arch
aarch64/genericfor repoeessi.io-2023.06-compat - arch
aarch64/genericfor repoeessi.io-2023.06-software - arch
aarch64/neoverse_n1for repoeessi-hpc.org-2023.06-compat - arch
aarch64/neoverse_n1for repoeessi-hpc.org-2023.06-software - arch
aarch64/neoverse_n1for repoeessi.io-2023.06-compat - arch
aarch64/neoverse_n1for repoeessi.io-2023.06-software - arch
aarch64/neoverse_v1for repoeessi-hpc.org-2023.06-compat - arch
aarch64/neoverse_v1for repoeessi-hpc.org-2023.06-software - arch
aarch64/neoverse_v1for repoeessi.io-2023.06-compat - arch
aarch64/neoverse_v1for repoeessi.io-2023.06-software
Instance eessi-bot-mc-azure is configured to build:
- arch
x86_64/amd/zen4for repoeessi-hpc.org-2023.06-compat - arch
x86_64/amd/zen4for repoeessi-hpc.org-2023.06-software - arch
x86_64/amd/zen4for repoeessi.io-2023.06-compat - arch
x86_64/amd/zen4for repoeessi.io-2023.06-software
Initially we'll build only for zen2 and aarch64/generic...
bot: build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-software bot: build arch:aarch64/generic repo:eessi.io-2023.06-software
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/generic repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/generic repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-softwareresulted in:- submitted job
12607, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2162771800
- submitted job
-
handling command
build architecture:aarch64/generic repository:eessi.io-2023.06-softwareresulted in:- submitted job
12608, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2162771906
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/generic repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/generic repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
-
handling command
build architecture:aarch64/generic repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12607
- fails in the sanity check for
librosa/0.10.1-foss-2023awhen runningpython -c "import soundfile"with the log messages
== 2024-06-12 12:00:43,829 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: extensions sanity check failed for 1 extensions: soundfile
failing sanity check for 'soundfile' extension: command "python -c "import soundfile"" failed; output:
Traceback (most recent call last):
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 161, in <module>
import _soundfile_data # ImportError if this doesn't exist
^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named '_soundfile_data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 171, in <module>
_snd = _ffi.dlopen(_libname)
^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so.1': libsndfile.so.1: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 192, in <module>
_snd = _ffi.dlopen(_explicit_libname)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory, (at easybuild/framework/easyblock.py:3669 in _sanity_check_step)
- to work around this error we need a custom
ctypes
| date | job status | comment |
|---|---|---|
| Jun 12 11:27:18 UTC 2024 | submitted | job id 12607 awaits release by job manager |
| Jun 12 11:28:21 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 12 11:35:26 UTC 2024 | running | job 12607 is running |
| Jun 12 12:08:26 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 12 12:08:26 UTC 2024 | test result | :cry: FAILURE (click triangle for details)
|
New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12608
- fails in the sanity check for
librosa/0.10.1-foss-2023awhen runningpython -c "import soundfile"with the log messages
== 2024-06-12 11:55:32,669 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: extensions sanity check failed for 1 extensions: soundfile
failing sanity check for 'soundfile' extension: command "python -c "import soundfile"" failed; output:
Traceback (most recent call last):
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 161, in <module>
import _soundfile_data # ImportError if this doesn't exist
^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named '_soundfile_data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 171, in <module>
_snd = _ffi.dlopen(_libname)
^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so.1': libsndfile.so.1: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/librosa/0.10.1-foss-2023a/lib/python3.11/site-packages/soundfile.py", line 192, in <module>
_snd = _ffi.dlopen(_explicit_libname)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory, (at easybuild/framework/easyblock.py:3669 in _sanity_check_step)
- to work around this error we need a custom
ctypes
| date | job status | comment |
|---|---|---|
| Jun 12 11:27:22 UTC 2024 | submitted | job id 12608 awaits release by job manager |
| Jun 12 11:28:19 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 12 11:34:23 UTC 2024 | running | job 12608 is running |
| Jun 12 12:04:20 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 12 12:04:20 UTC 2024 | test result | :cry: FAILURE (click triangle for details)
|
The two jobs (12607 and 12608) that did not include any fixes failed both in the sanity check for librosa. After enabling the fixes for that by
- installing a custom
ctypeslibrary; - adding a
parse_hookto use the customctypeslibrary in the sanity check; and - adding a
pre_module_hookthat adds a setting to use this customctypeslibrary when the module forlibrosais loaded;
we repeat the building for the same architectures zen2 and aarch64/generic...
bot: build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-software bot: build arch:aarch64/generic repo:eessi.io-2023.06-software
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/generic repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/generic repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-softwareresulted in:- submitted job
12808, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2169398033
- submitted job
-
handling command
build architecture:aarch64/generic repository:eessi.io-2023.06-softwareresulted in:- submitted job
12809, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2169398074
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/generic repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/generic repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
-
handling command
build architecture:aarch64/generic repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12808
- failed with errors when testing the extension
torchvisionofPyTorch-bundle...
=================================== FAILURES ===================================
___ test_decode_jpeg[None-ImageReadMode.UNCHANGED-grace_hopper_517x606.jpg] ____
test/test_image.py:94: in test_decode_jpeg
img_ljpeg = decode_image(data, mode=mode)
/tmp/eb-7t6okia0/eb-js7oqjgv/tmpjpww4km2/lib/python3.11/site-packages/torchvision/io/image.py:236: in decode_image
output = torch.ops.image.decode_image(input, mode.value)
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/PyTorch/2.1.2-foss-2023a/lib/python3.11/site-packages/torch/_ops.py:692: in __call__
return self._op(*args, **kwargs or {})
E RuntimeError: decode_jpeg: torchvision not compiled with libjpeg support
- inspecting the job's individual build step logs (via
bot/inspect.sh --resume previous_tmp/build_step/eessi.io-2023.06-software-1718457554.tgzrun in the job's working directory/project/def-users/SHARED/jobs/2024.06/pr_603/12808on the same type of node // e.g., via an interactive job submitted withsrun --partition x86-64-amd-zen2-node --time=60 --pty bash), we find the following messages in/tmp/eb-7t6okia0/eb-js7oqjgv/easybuild-run_cmd-9b5lqisq.log(log file for building the extensiontorchvision)
Compiling extensions with following flags:
FORCE_CUDA: False
FORCE_MPS: False
DEBUG: False
TORCHVISION_USE_PNG: True
TORCHVISION_USE_JPEG: True
TORCHVISION_USE_NVJPEG: True
TORCHVISION_USE_FFMPEG: True
TORCHVISION_USE_VIDEO_CODEC: True
NVCC_FLAGS:
Compiling with debug mode OFF
Found PNG library
Building torchvision with PNG image support
libpng version: 1.6.39
libpng include path: /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/libpng/1.6.39-GCCcore-12.3.0/include/libpng16
Running build on conda-build: False
Running build on conda: False
Building torchvision without JPEG image support
Building torchvision without NVJPEG image support
- it looks like it doesn't find the
jpeglibrary and hence builds withoutJPEGsupport - consequently, it later fails in the test step
- the
setup.pyin/tmp/bot/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchvision/vision-0.16.2that produces the above messages showing thattorchvisionis compiled withoutJPEGsupport includes a functionfind_librarywith the following codedef find_library(name, vision_include): this_dir = os.path.dirname(os.path.abspath(__file__)) build_prefix = os.environ.get("BUILD_PREFIX", None) is_conda_build = build_prefix is not None library_found = False conda_installed = False lib_folder = None include_folder = None library_header = f"{name}.h" # Lookup in TORCHVISION_INCLUDE or in the package file package_path = [os.path.join(this_dir, "torchvision")] for folder in vision_include + package_path: candidate_path = os.path.join(folder, library_header) library_found = os.path.exists(candidate_path) if library_found: break - running the build script (
setup.py) manually in an "inspect" session revealed that the second parameter tofind_librarywas an empty list[]- the suspicion is that
TORCHVISION_INCLUDEwas not set although it should have been if the easyblock fortorchvisionis used, see https://github.com/easybuilders/easybuild-easyblocks/blob/10e9a62d44d653e04f735962620a33bc22225477/easybuild/easyblocks/t/torchvision.py#L83-L85
- the suspicion is that
| date | job status | comment |
|---|---|---|
| Jun 15 12:04:28 UTC 2024 | submitted | job id 12808 awaits release by job manager |
| Jun 15 12:04:32 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 15 12:10:36 UTC 2024 | running | job 12808 is running |
| Jun 15 13:47:58 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 15 13:47:58 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12809
- failed in the sanity check for
SentencePiece/0.2.0-GCC-12.3.0with the following log messages
== 2024-06-15 12:40:44,834 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: sanity check command python -c 'import sentencepiece' exited with code 1 (output: Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <module>
from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static TLS block
) (at easybuild/framework/easyblock.py:3669 in _sanity_check_step)
| date | job status | comment |
|---|---|---|
| Jun 15 12:04:32 UTC 2024 | submitted | job id 12809 awaits release by job manager |
| Jun 15 12:05:34 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 15 12:11:38 UTC 2024 | running | job 12809 is running |
| Jun 15 13:04:14 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 15 13:04:14 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
The two jobs (12608 // zen2 and 12609 // aarch64/generic) didn't fail for the earlier reason (import of soundfile failed). They failed for different reasons however (for details see above). We first fix the issue for aarch64/generic (because the build for that architecture failed earlier than the build for zen2). The fix disables the use of the TC_MALLOC library. Because the fix is made for aarch64/generic only, we also check if builds for the other aarch64 are affected by the issue.
bot: build arch:aarch64/generic repo:eessi.io-2023.06-software bot: build arch:aarch64/neoverse_n1 repo:eessi.io-2023.06-software bot: build arch:aarch64/neoverse_v1 repo:eessi.io-2023.06-software
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build arch:aarch64/generic repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/generic repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/neoverse_n1 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/neoverse_n1 repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/neoverse_v1 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/neoverse_v1 repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:aarch64/generic repository:eessi.io-2023.06-softwareresulted in:- submitted job
12813, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2170460546
- submitted job
-
handling command
build architecture:aarch64/neoverse_n1 repository:eessi.io-2023.06-softwareresulted in:- submitted job
12814, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2170460599
- submitted job
-
handling command
build architecture:aarch64/neoverse_v1 repository:eessi.io-2023.06-softwareresulted in:- submitted job
12815, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2170460654
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build arch:aarch64/generic repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/generic repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/neoverse_n1 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/neoverse_n1 repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/neoverse_v1 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/neoverse_v1 repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:aarch64/generic repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
-
handling command
build architecture:aarch64/neoverse_n1 repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
-
handling command
build architecture:aarch64/neoverse_v1 repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12813
- fails with a new error for extension
torchtext
== 2024-06-15 18:44:56,282 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): cmd "export PYTHONPATH=/tmp/eb-4o0di9ui/eb-qo9jlvzo/tmp0g004
oib/lib/python3.11/site-packages:$PYTHONPATH && pytest test/torchtext_unittest -k "not test_vocab_from_raw_text_file"" and not test_get_tokenizer_moses"" and not test_get_tokenizer_spacy"" and no
t test_download_charngram_vectors" " exited with exit code -11 and output:
Fatal Python error: Segmentation fault
Current thread 0x000040002a9e5a00 (most recent call first):
File "/tmp/bot/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/test/torchtext_unittest/test_transforms.py", line 1268 in TestMaskTransform
File "/tmp/bot/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/test/torchtext_unittest/test_transforms.py", line 1255 in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/assertion/rewrite.py", line
178 in exec_module
File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
File "<frozen importlib._bootstrap>", line 1206 in _gcd_import
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/importlib/__init__.py", line 126 in import_module
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/pathlib.py", line 565 in im
port_path
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/python.py", line 617 in _im
porttestmodule
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/python.py", line 528 in _ge
tobj
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/python.py", line 310 in obj
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/python.py", line 545 in _in
ject_setup_module_fixture
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/python.py", line 531 in col
lect
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/runner.py", line 372 in <la
mbda>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/runner.py", line 341 in fro
m_call
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/runner.py", line 372 in pyt
est_make_collect_report
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/runner.py", line 547 in col
lect_one_node
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/main.py", line 836 in genit
ems
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/main.py", line 839 in genit
ems
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/main.py", line 669 in perfo
rm_collect
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/main.py", line 334 in pytes
t_collection
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/main.py", line 323 in _main
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/main.py", line 270 in wrap_
session
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/main.py", line 317 in pytes
t_cmdline_main
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/hatchling/1.18.0-GCCcore-12.3.0/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/config/__init__.py", line 1
66 in main
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/config/__init__.py", line 1
89 in console_main
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/generic/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/bin/pytest", line 8 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.ra
ndom._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._lina
lg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, simplejson._speedups (total: 22)
- it may be that we have seen that earlier when building for NESSI ... we didn't have a fix for that there, so this requires more investigation
| date | job status | comment |
|---|---|---|
| Jun 15 18:07:39 UTC 2024 | submitted | job id 12813 awaits release by job manager |
| Jun 15 18:08:23 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 15 18:13:30 UTC 2024 | running | job 12813 is running |
| Jun 15 19:09:48 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 15 19:09:48 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_n1 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12814
- failed with the same error as
aarch64/generic
== 2024-06-15 18:42:59,199 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: sanity check command python -c 'import
sentencepiece' exited with code 1 (output: Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <m
odule>
from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static T
LS block
| date | job status | comment |
|---|---|---|
| Jun 15 18:07:43 UTC 2024 | submitted | job id 12814 awaits release by job manager |
| Jun 15 18:08:25 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 15 18:14:32 UTC 2024 | running | job 12814 is running |
| Jun 15 19:06:45 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 15 19:06:45 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_v1 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12815
- failed with the same error as on
aarch64/generic
== 2024-06-15 18:36:00,141 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: sanity check command python -c 'import
sentencepiece' exited with code 1 (output: Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_v1/software/SentencePiece/0.2.0-GCC-12.3.0/lib/python3.11/site-packages/sentencepiece/__init__.py", line 10, in <m
odule>
from . import _sentencepiece
ImportError: /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_v1/software/gperftools/2.12-GCCcore-12.3.0/lib64/libtcmalloc_minimal.so.4: cannot allocate memory in static T
LS block
| date | job status | comment |
|---|---|---|
| Jun 15 18:07:47 UTC 2024 | submitted | job id 12815 awaits release by job manager |
| Jun 15 18:08:27 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 15 18:14:34 UTC 2024 | running | job 12815 is running |
| Jun 15 18:52:16 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 15 18:52:16 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Rebuilding for aarch64/neoverse_n1 and aarch64/neoverse_v1 after fix for SentencePiece has been extended to these architectures...
bot: build arch:aarch64/neoverse_n1 repo:eessi.io-2023.06-software bot: build arch:aarch64/neoverse_v1 repo:eessi.io-2023.06-software
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build arch:aarch64/neoverse_n1 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/neoverse_n1 repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/neoverse_v1 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/neoverse_v1 repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:aarch64/neoverse_n1 repository:eessi.io-2023.06-softwareresulted in:- submitted job
12816, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2170583100
- submitted job
-
handling command
build architecture:aarch64/neoverse_v1 repository:eessi.io-2023.06-softwareresulted in:- submitted job
12817, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2170583352
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build arch:aarch64/neoverse_n1 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/neoverse_n1 repository:eessi.io-2023.06-software
- expanded format:
-
received bot command
build arch:aarch64/neoverse_v1 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:aarch64/neoverse_v1 repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:aarch64/neoverse_n1 repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
-
handling command
build architecture:aarch64/neoverse_v1 repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_n1 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12816
- now fails with the same error as the build for
aarch64/generic
== 2024-06-15 20:08:01,404 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): cmd "export PYTHONPATH=/tmp/eb-t17gza4h/eb-ul5a_hbb/tmpr1l71
y06/lib/python3.11/site-packages:$PYTHONPATH && pytest test/torchtext_unittest -k "not test_vocab_from_raw_text_file"" and not test_get_tokenizer_moses"" and not test_get_tokenizer_spacy"" and no
t test_download_charngram_vectors" " exited with exit code -11 and output:
Fatal Python error: Segmentation fault
Current thread 0x000040003d3e5a80 (most recent call first):
File "/tmp/bot/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/test/torchtext_unittest/test_transforms.py", line 1268 in TestMaskTransform
File "/tmp/bot/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/test/torchtext_unittest/test_transforms.py", line 1255 in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/assertion/rewrite.py",
line 178 in exec_module
File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
File "<frozen importlib._bootstrap>", line 1206 in _gcd_import
...
| date | job status | comment |
|---|---|---|
| Jun 15 19:34:52 UTC 2024 | submitted | job id 12816 awaits release by job manager |
| Jun 15 19:35:52 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 15 19:36:56 UTC 2024 | running | job 12816 is running |
| Jun 15 20:35:33 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 15 20:35:33 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_v1 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/12817
- now fails with the same error as the build for
aarch64/generic
== 2024-06-15 20:00:37,536 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): cmd "export PYTHONPATH=/tmp/eb-663ngo7q/eb-6zm49he7/tmph9cft
g0x/lib/python3.11/site-packages:$PYTHONPATH && pytest test/torchtext_unittest -k "not test_vocab_from_raw_text_file"" and not test_get_tokenizer_moses"" and not test_get_tokenizer_spacy"" and no
t test_download_charngram_vectors" " exited with exit code -11 and output:
Fatal Python error: Segmentation fault
Current thread 0x000040003cc75a80 (most recent call first):
File "/tmp/bot/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/test/torchtext_unittest/test_transforms.py", line 1268 in TestMaskTransform
File "/tmp/bot/easybuild/build/PyTorchbundle/2.1.2/foss-2023a/torchtext/text-0.16.2/test/torchtext_unittest/test_transforms.py", line 1255 in <module>
File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_v1/software/Python-bundle-PyPI/2023.06-GCCcore-12.3.0/lib/python3.11/site-packages/_pytest/assertion/rewrite.py",
line 178 in exec_module
File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
File "<frozen importlib._bootstrap>", line 1206 in _gcd_import
| date | job status | comment |
|---|---|---|
| Jun 15 19:34:56 UTC 2024 | submitted | job id 12817 awaits release by job manager |
| Jun 15 19:35:54 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 15 19:36:58 UTC 2024 | running | job 12817 is running |
| Jun 15 20:18:15 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 15 20:18:15 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Rebuilding for zen2 to verify if a new easyblock for torchvision fixes the issue that libjpeg couldn't be find...
bot: build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-software
Updates by the bot instance boegel-bot-deucalion
(click for details)
- account
trz42has NO permission to send commands to the bot
Updates by the bot instance eessi-bot-mc-aws
(click for details)
-
received bot command
build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-softwareresulted in:- submitted job
13549, for details & status see https://github.com/EESSI/software-layer/pull/603#issuecomment-2198332951
- submitted job
Updates by the bot instance eessi-bot-mc-azure
(click for details)
-
received bot command
build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-softwarefromtrz42- expanded format:
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-software
- expanded format:
-
handling command
build architecture:x86_64/amd/zen2 repository:eessi.io-2023.06-softwareresulted in:- no jobs were submitted
New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.06/pr_603/13549
- the installation of
PyTorch-bundlesucceeded, so the updated easyblock fortorchvisionworks! :tada: - however, the build failed when checking for missing installations with
1 out of 138 required modules missing:
* grpcio/1.57.0-GCCcore-12.3.0 (grpcio-1.57.0-GCCcore-12.3.0.eb)
- that should be easy to fix, see https://github.com/NorESSI/software-layer/pull/408
| date | job status | comment |
|---|---|---|
| Jun 29 20:55:20 UTC 2024 | submitted | job id 13549 awaits release by job manager |
| Jun 29 20:55:26 UTC 2024 | released | job awaits launch by Slurm scheduler |
| Jun 29 21:00:28 UTC 2024 | running | job 13549 is running |
| Jun 29 23:04:35 UTC 2024 | finished | :cry: FAILURE (click triangle for details)
|
| Jun 29 23:04:35 UTC 2024 | test result | :grin: SUCCESS (click triangle for details)
|
Rebuilding for
zen2to verify if a new easyblock for torchvision fixes the issue thatlibjpegcouldn't be find...
Maybe related to:
- https://github.com/easybuilders/easybuild-easyblocks/pull/3322
Rebuilding after #655 got merged to verify if the import soundfile in librosa's sanity check succeeds...
bot: build arch:x86_64/amd/zen2 repo:eessi.io-2023.06-software bot: build arch:aarch64/generic repo:eessi.io-2023.06-software