RF native failed to compile / load, inconsistent behavior to pure Python, dividing a tensor of type int
PyExtModCompiler call: g++ -shared -O2 -std=c++11 -fno-strict-overflow -Wsign-compare -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -I /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/_native -I /usr/include/python3.12 -fPIC -v -D_GLIBCXX_USE_CXX11_ABI=0 -g /w0/tmp/slurm_az668407.60282320/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/b20035631a/_returnn_frontend_native.cc -o /w0/tmp/slurm_az668407.60282320/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/b20035631a/_returnn_frontend_native.so
RETURNN frontend _native backend: Error while getting module:
/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /w0/tmp/slurm_az668407.60282320/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/b20035631a/_returnn_frontend_native.so)
This is optional (although very recommended), so we continue without it.
So the compilation (or just the load of the native module) fails with /lib64/libstdc++.so.6: version GLIBCXX_3.4.30' not found`.
That then causes the following error:
...
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/frontend/_backend.py", line 1489, in TorchBackend.reduce
line: correction_factor = rf.masked_fraction_of_shape(axis, inverse=True)
locals:
correction_factor = <local> None
axis = <local> [Dim{B}, Dim{'⌈((-199+time)+-200)/160⌉'[B]}]
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/dims.py", line 283, in masked_fraction_of_shape
line: return (num_elems_masked / num_elems_total) if not inverse else (num_elems_total / num_elems_masked)
locals:
num_elems_masked = <local> Tensor{'reduce_sum', [], dtype='int64'}
num_elems_total = <local> Tensor{'mul', [], dtype='int32'}
inverse = <local> True
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/tensor/_tensor_op_overloads.py", line 84, in _TensorOpOverloadsMixin.__truediv__
line: return _rf().combine(self, "/", other)
locals:
self = <local> Tensor{'mul', [], dtype='int32'}
other = <local> Tensor{'reduce_sum', [], dtype='int64'}
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/math_.py", line 211, in combine
line: raise ValueError(
"Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float."
)
ValueError: Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float.
...
Module call stack:
(Model.__call__) (root)
(BatchNorm.__call__) feature_batch_norm
(BatchNorm.__call__.<locals>.<lambda>) feature_batch_norm
This particular symptom / error was also described in https://github.com/rwth-i6/returnn/pull/1637#issuecomment-2426030386. The issue is that the optimized native RF code behaves different (allows such code) than the pure Python RF code.
Some side remark: I also wonder why I get this GLIBCXX_3.4.30 error now. I think I got this before, but somehow resolved it (I forgot though what I did...), but ~now~ I get it again (without having changed anything in my env)?
Edit: "Now" was wrong. Actually, I see that I seem to get this warning already since a while, but in other setups, this did not cause problems (except that it was slower as it could be).
This is surely not RETURNN related, but I anyway want to discuss it here, for future reference.
This anyway should be resolved, as using the native RF helpers will have quite a big impact on speed.
I also wonder a bit about "This is optional (although very recommended), so we continue without it.". Actually I don't want that it just ignores this and continues. I want that it aborts if this happens. But I also see that ignoring it might be fine for some users or use cases. So, I guess this should be configurable. But what should be the default?
In my current interactive session, I can reproduce the compile error when running the tests. In this session, g++ points to /cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/GCCcore/13.3.0/bin/g++.
Traceback (most recent call last):
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/tests/test_rf_base.py", line 21, in <module>
line: _setup()
locals:
_setup = <local> <function __main__._setup>
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/tests/test_rf_base.py", line 18, in _setup
line: rf.select_backend_torch() # enables some of the native optimizations
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/_backend.py", line 1452, in select_backend_torch
line: _native.setup()
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/_native/__init__.py", line 98, in setup
line: mod = get_module()
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/_native/__init__.py", line 56, in get_module
line: module = compiler.load_py_module()
locals:
compiler = <local> <PyExtModCompiler '_returnn_frontend_native' in '/tmp/az668407/login23-4_2471572/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/16cd4b86db'>
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/py_ext_mod_compiler.py", line 55, in PyExtModCompiler.load_py_module
line: mod = module_from_spec(spec)
locals:
module_from_spec = <local> <function _frozen_importlib.module_from_spec>
spec = <local> ModuleSpec(name='_returnn_frontend_native', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x152c66196fc0>, origin='/tmp/az668407/login23-4_2471572/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/16cd4b86db/_returnn_frontend_native.so')
File "<frozen importlib._bootstrap>", line 813, in module_from_spec
-- code not available --
File "<frozen importlib._bootstrap_external>", line 1293, in ExtensionFileLoader.create_module
-- code not available --
File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
-- code not available --
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /tmp/az668407/login23-4_2471572/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/16cd4b86db/_returnn_frontend_native.so)
Weird. So I did module load GCCcore/13.3.0. But actually g++ already pointing to GCC 13.3.0 before (see above). module load GCCcore/13.3.0 produced this output:
[INFO] Module zlib/1.3.1 loaded.
[INFO] Module binutils/2.42 loaded.
[INFO] Module GCCcore/13.3.0 loaded.
[INFO] Module zlib/1.3.1 loaded.
[INFO] Module binutils/2.42 loaded.
And now, at least the test seems to pass, and it can load the native RF helpers?
(Note, module here on the RWTH HPC cluster is using Lmod.)
Edit But when I start a fresh session, it does not work. I always need to rerun module load GCCcore/13.3.0. Not sure if I can make this permament?
Note, I did a diff of env from a broken env to working env. Here some potential relevant parts:
-LD_LIBRARY_PATH=/lib64:/home/az668407/libs/claix2023:/lib64:/home/az668407/libs/claix2023:/lib64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl-FFTW/2024.2.0-iimpi-2024a/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/lib/intel64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/2021.13/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/GCCcore/13.3.0/lib64
+LD_LIBRARY_PATH=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/GCCcore/13.3.0/lib64:/lib64:/home/az668407/libs/claix2023:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl-FFTW/2024.2.0-iimpi-2024a/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/lib/intel64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/2021.13/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/2024.2/lib
-LIBRARY_PATH=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl-FFTW/2024.2.0-iimpi-2024a/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/lib/intel64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/latest/linux/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/latest/linux/compiler/lib/intel64_lin:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/lib/release:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/latest/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/2021.13/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/zlib/1.3.1-GCCcore-13.3.0/lib
+LIBRARY_PATH=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl-FFTW/2024.2.0-iimpi-2024a/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/lib/intel64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/latest/linux/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/latest/linux/compiler/lib/intel64_lin:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/lib/release:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/latest/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/2021.13/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/2024.2/lib
Edit Now I did module save. Let's see if this makes it persistent.
Edit It is still bad when I do another srun. But maybe that's because the parent env is also broken in the same way.
Edit Hm, doing module restore (which also fixes the env now, as I did module save before) in the parent, then doing srun, this will still give me a bad env. I guess srun ignores the parent env.
So what do I have in a fresh env?
When I do module purge, i.e. no modules are loaded (module list says: "No modules loaded"), it also doesn't work.
After module restore, it works.
This is what module list shows in a working env:
Currently Loaded Modules:
1) intel-compilers/2024.2.0 (C) 3) imkl/2024.2.0 (m) 5) intel/2024a (TC) 7) zlib/1.3.1
2) impi/2021.13.0 (M) 4) imkl-FFTW/2024.2.0 6) GCCcore/13.3.0 (C) 8) binutils/2.42
This is what module list shows in a fresh env (after srun):
Currently Loaded Modules:
1) intel-compilers/2024.2.0 (C) 3) imkl/2024.2.0 (m) 5) intel/2024a (TC) 7) zlib/1.3.1
2) impi/2021.13.0 (M) 4) imkl-FFTW/2024.2.0 6) GCCcore/13.3.0 (C) 8) binutils/2.42
Edit I noticed, I have export LD_LIBRARY_PATH="/lib64:$LD_LIBRARY_PATH" in my ~/.bashrc. (I also have export LD_LIBRARY_PATH="/home/az668407/libs/claix2023:/lib64:$LD_LIBRARY_PATH" in my ~/.zshenv...)
This seems to be it! When removing the /lib64 at the beginning, it works fine (in a fresh env).
Edit Just to confirm, in a fresh srun env (where it runs bash, so I guess it uses ~/.bashrc), it is working fine now.
Edit Jobs scheduled by Sisyphus via sbatch work fine now as well. (Initially, for some reason, I still had the same problem. But after a restart of Sisyphus, maybe also module restore in that parent env, and some waiting, it worked.)