returnn icon indicating copy to clipboard operation
returnn copied to clipboard

RF native failed to compile / load, inconsistent behavior to pure Python, dividing a tensor of type int

Open albertz opened this issue 4 months ago • 3 comments

PyExtModCompiler call: g++ -shared -O2 -std=c++11 -fno-strict-overflow -Wsign-compare -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -I /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/_native -I /usr/include/python3.12 -fPIC -v -D_GLIBCXX_USE_CXX11_ABI=0 -g /w0/tmp/slurm_az668407.60282320/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/b20035631a/_returnn_frontend_native.cc -o /w0/tmp/slurm_az668407.60282320/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/b20035631a/_returnn_frontend_native.so
RETURNN frontend _native backend: Error while getting module:
/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /w0/tmp/slurm_az668407.60282320/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/b20035631a/_returnn_frontend_native.so)
This is optional (although very recommended), so we continue without it.

So the compilation (or just the load of the native module) fails with /lib64/libstdc++.so.6: version GLIBCXX_3.4.30' not found`.

That then causes the following error:

...
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/frontend/_backend.py", line 1489, in TorchBackend.reduce
    line: correction_factor = rf.masked_fraction_of_shape(axis, inverse=True)
    locals:
      correction_factor = <local> None
      axis = <local> [Dim{B}, Dim{'⌈((-199+time)+-200)/160⌉'[B]}]
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/dims.py", line 283, in masked_fraction_of_shape
    line: return (num_elems_masked / num_elems_total) if not inverse else (num_elems_total / num_elems_masked)
    locals:
      num_elems_masked = <local> Tensor{'reduce_sum', [], dtype='int64'}
      num_elems_total = <local> Tensor{'mul', [], dtype='int32'}
      inverse = <local> True
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/tensor/_tensor_op_overloads.py", line 84, in _TensorOpOverloadsMixin.__truediv__
    line: return _rf().combine(self, "/", other)
    locals:
      self = <local> Tensor{'mul', [], dtype='int32'}
      other = <local> Tensor{'reduce_sum', [], dtype='int64'}
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/math_.py", line 211, in combine
    line: raise ValueError(
              "Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float."
          )
ValueError: Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float.

...
Module call stack:
(Model.__call__) (root)
(BatchNorm.__call__) feature_batch_norm
(BatchNorm.__call__.<locals>.<lambda>) feature_batch_norm

This particular symptom / error was also described in https://github.com/rwth-i6/returnn/pull/1637#issuecomment-2426030386. The issue is that the optimized native RF code behaves different (allows such code) than the pure Python RF code.

albertz avatar Aug 14 '25 15:08 albertz

Some side remark: I also wonder why I get this GLIBCXX_3.4.30 error now. I think I got this before, but somehow resolved it (I forgot though what I did...), but ~now~ I get it again (without having changed anything in my env)?

Edit: "Now" was wrong. Actually, I see that I seem to get this warning already since a while, but in other setups, this did not cause problems (except that it was slower as it could be).

This is surely not RETURNN related, but I anyway want to discuss it here, for future reference.

This anyway should be resolved, as using the native RF helpers will have quite a big impact on speed.

I also wonder a bit about "This is optional (although very recommended), so we continue without it.". Actually I don't want that it just ignores this and continues. I want that it aborts if this happens. But I also see that ignoring it might be fine for some users or use cases. So, I guess this should be configurable. But what should be the default?

albertz avatar Aug 14 '25 15:08 albertz

In my current interactive session, I can reproduce the compile error when running the tests. In this session, g++ points to /cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/GCCcore/13.3.0/bin/g++.

Traceback (most recent call last):
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/tests/test_rf_base.py", line 21, in <module>
    line: _setup()
    locals:
      _setup = <local> <function __main__._setup>
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/tests/test_rf_base.py", line 18, in _setup
    line: rf.select_backend_torch()  # enables some of the native optimizations
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/_backend.py", line 1452, in select_backend_torch
    line: _native.setup()
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/_native/__init__.py", line 98, in setup
    line: mod = get_module()
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/_native/__init__.py", line 56, in get_module
    line: module = compiler.load_py_module()
    locals:
      compiler = <local> <PyExtModCompiler '_returnn_frontend_native' in '/tmp/az668407/login23-4_2471572/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/16cd4b86db'>
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/py_ext_mod_compiler.py", line 55, in PyExtModCompiler.load_py_module
    line: mod = module_from_spec(spec)
    locals:
      module_from_spec = <local> <function _frozen_importlib.module_from_spec>
      spec = <local> ModuleSpec(name='_returnn_frontend_native', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x152c66196fc0>, origin='/tmp/az668407/login23-4_2471572/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/16cd4b86db/_returnn_frontend_native.so')
  File "<frozen importlib._bootstrap>", line 813, in module_from_spec
    -- code not available --
  File "<frozen importlib._bootstrap_external>", line 1293, in ExtensionFileLoader.create_module
    -- code not available --
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
    -- code not available --
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /tmp/az668407/login23-4_2471572/az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/16cd4b86db/_returnn_frontend_native.so)

albertz avatar Aug 14 '25 15:08 albertz

Weird. So I did module load GCCcore/13.3.0. But actually g++ already pointing to GCC 13.3.0 before (see above). module load GCCcore/13.3.0 produced this output:

[INFO] Module zlib/1.3.1 loaded.                                                                                                      
[INFO] Module binutils/2.42 loaded.
[INFO] Module GCCcore/13.3.0 loaded.                  
[INFO] Module zlib/1.3.1 loaded.                                                                                                      
[INFO] Module binutils/2.42 loaded.                  

And now, at least the test seems to pass, and it can load the native RF helpers?

(Note, module here on the RWTH HPC cluster is using Lmod.)

Edit But when I start a fresh session, it does not work. I always need to rerun module load GCCcore/13.3.0. Not sure if I can make this permament?

Note, I did a diff of env from a broken env to working env. Here some potential relevant parts:

-LD_LIBRARY_PATH=/lib64:/home/az668407/libs/claix2023:/lib64:/home/az668407/libs/claix2023:/lib64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl-FFTW/2024.2.0-iimpi-2024a/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/lib/intel64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/2021.13/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/GCCcore/13.3.0/lib64
+LD_LIBRARY_PATH=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/GCCcore/13.3.0/lib64:/lib64:/home/az668407/libs/claix2023:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl-FFTW/2024.2.0-iimpi-2024a/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/lib/intel64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/2021.13/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/2024.2/lib
-LIBRARY_PATH=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl-FFTW/2024.2.0-iimpi-2024a/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/lib/intel64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/latest/linux/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/latest/linux/compiler/lib/intel64_lin:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/lib/release:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/latest/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/2021.13/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/zlib/1.3.1-GCCcore-13.3.0/lib
+LIBRARY_PATH=/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl-FFTW/2024.2.0-iimpi-2024a/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/mkl/2024.2/lib/intel64:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/imkl/2024.2.0/compiler/2024.2/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/impi/2021.13.0-intel-compilers-2024.2.0/mpi/2021.13/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/latest/linux/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/latest/linux/compiler/lib/intel64_lin:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/lib/release:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/mpi/latest/libfabric/lib:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/latest/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/tbb/2021.13/lib/intel64/gcc4.8:/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/intel-compilers/2024.2.0/compiler/2024.2/lib

Edit Now I did module save. Let's see if this makes it persistent.

Edit It is still bad when I do another srun. But maybe that's because the parent env is also broken in the same way.

Edit Hm, doing module restore (which also fixes the env now, as I did module save before) in the parent, then doing srun, this will still give me a bad env. I guess srun ignores the parent env.

So what do I have in a fresh env?

When I do module purge, i.e. no modules are loaded (module list says: "No modules loaded"), it also doesn't work.

After module restore, it works.

This is what module list shows in a working env:

Currently Loaded Modules:
  1) intel-compilers/2024.2.0 (C)   3) imkl/2024.2.0      (m)   5) intel/2024a    (TC)   7) zlib/1.3.1
  2) impi/2021.13.0           (M)   4) imkl-FFTW/2024.2.0       6) GCCcore/13.3.0 (C)    8) binutils/2.42

This is what module list shows in a fresh env (after srun):

Currently Loaded Modules:
  1) intel-compilers/2024.2.0 (C)   3) imkl/2024.2.0      (m)   5) intel/2024a    (TC)   7) zlib/1.3.1
  2) impi/2021.13.0           (M)   4) imkl-FFTW/2024.2.0       6) GCCcore/13.3.0 (C)    8) binutils/2.42

Edit I noticed, I have export LD_LIBRARY_PATH="/lib64:$LD_LIBRARY_PATH" in my ~/.bashrc. (I also have export LD_LIBRARY_PATH="/home/az668407/libs/claix2023:/lib64:$LD_LIBRARY_PATH" in my ~/.zshenv...)

This seems to be it! When removing the /lib64 at the beginning, it works fine (in a fresh env).

Edit Just to confirm, in a fresh srun env (where it runs bash, so I guess it uses ~/.bashrc), it is working fine now.

Edit Jobs scheduled by Sisyphus via sbatch work fine now as well. (Initially, for some reason, I still had the same problem. But after a restart of Sisyphus, maybe also module restore in that parent env, and some waiting, it worked.)

albertz avatar Aug 14 '25 15:08 albertz