returnn icon indicating copy to clipboard operation
returnn copied to clipboard

RF combine inconcistent between native and pure Python

Open albertz opened this issue 7 months ago • 2 comments

...
PyExtModCompiler call: g++ -shared -O2 -std=c++11 -fno-strict-overflow -Wsign-compare -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security 
-Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-cla
sh-protection -fcf-protection -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-
gcc-switches -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 
-Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-prote
ction -I /rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/frontend/_native -I /usr/include/python3.12 -fPIC -v -D_GLIBCXX_USE_CXX11_A
BI=0 -g /w0/tmp/slurm_tt201262.56034450/tt201262/returnn_py_ext_mod_cache/_returnn_frontend_native/e8867c33af/_returnn_frontend_native.cc -o /w0/tmp/slurm_tt201262.560344
50/tt201262/returnn_py_ext_mod_cache/_returnn_frontend_native/e8867c33af/_returnn_frontend_native.so
PyExtModCompiler: g++ failed.
...
  File "/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/torch/frontend/_backend.py", line 1488, in TorchBackend.reduce
    line: correction_factor = rf.masked_fraction_of_shape(axis, inverse=True)
    locals:
      correction_factor = <local> None
      rf = <global> <module 'returnn.frontend' from '/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/frontend/__init__.py'>
      rf.masked_fraction_of_shape = <global> <function masked_fraction_of_shape at 0x14a2eef8a8e0>
      axis = <local> [Dim{B}, Dim{'⌈((-199+wave)+-200)/160⌉'[B]}]
      inverse = <not found>
  File "/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/frontend/dims.py", line 283, in masked_fraction_of_shape
    line: return (num_elems_masked / num_elems_total) if not inverse else (num_elems_total / num_elems_masked)
    locals:
      num_elems_masked = <local> Tensor{'reduce_sum', [], dtype='int64'}
      num_elems_total = <local> Tensor{'mul', [], dtype='int32'}
      inverse = <local> True
  File "/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/tensor/_tensor_op_overloads.py", line 85, in _TensorOpOverloadsMixin.__truediv__
    line: return _rf().combine(self, "/", other)
    locals:
      _rf = <global> <function _rf at 0x14a2f41be3e0>
      combine = <not found>
      self = <local> Tensor{'mul', [], dtype='int32'}
      other = <local> Tensor{'reduce_sum', [], dtype='int64'}
  File "/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/frontend/math_.py", line 209, in combine
    line: raise ValueError(
              "Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float."
          )
    locals:
      ValueError = <builtin> <class 'ValueError'>
ValueError: Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float.

You see in this log, g++ fails, so the native RF helpers (_returnn_frontend_native) are not used, and the pure Python logic for rf.combine is used. The pure Python rf.combine is a bit more strict, and does not allow int / int, while the native RF helpers do not check this, and just let PyTorch do this, and then takes over the dtype from PyTorch.

albertz avatar Apr 09 '25 17:04 albertz

Btw, I think I saw some similar problems before, where the native RF helpers would always behave like allow_broadcast_all_sources=True, but the pure Python logic does not. I thought I filled some issue on this somewhere but I don't find it now...

albertz avatar Apr 09 '25 17:04 albertz

Note, the reason g++ did not work here: Python was not found. One solution was module load Python/3.12.3, or probably also putting the right Python into the $PATH. But the reason that g++ did not work is not really the point of the issue here. The point of the issue is that the native RF helpers behave different than the pure Python code.

albertz avatar Apr 09 '25 19:04 albertz