returnn
returnn copied to clipboard
RF combine inconcistent between native and pure Python
...
PyExtModCompiler call: g++ -shared -O2 -std=c++11 -fno-strict-overflow -Wsign-compare -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-cla
sh-protection -fcf-protection -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-
gcc-switches -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-prote
ction -I /rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/frontend/_native -I /usr/include/python3.12 -fPIC -v -D_GLIBCXX_USE_CXX11_A
BI=0 -g /w0/tmp/slurm_tt201262.56034450/tt201262/returnn_py_ext_mod_cache/_returnn_frontend_native/e8867c33af/_returnn_frontend_native.cc -o /w0/tmp/slurm_tt201262.560344
50/tt201262/returnn_py_ext_mod_cache/_returnn_frontend_native/e8867c33af/_returnn_frontend_native.so
PyExtModCompiler: g++ failed.
...
File "/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/torch/frontend/_backend.py", line 1488, in TorchBackend.reduce
line: correction_factor = rf.masked_fraction_of_shape(axis, inverse=True)
locals:
correction_factor = <local> None
rf = <global> <module 'returnn.frontend' from '/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/frontend/__init__.py'>
rf.masked_fraction_of_shape = <global> <function masked_fraction_of_shape at 0x14a2eef8a8e0>
axis = <local> [Dim{B}, Dim{'⌈((-199+wave)+-200)/160⌉'[B]}]
inverse = <not found>
File "/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/frontend/dims.py", line 283, in masked_fraction_of_shape
line: return (num_elems_masked / num_elems_total) if not inverse else (num_elems_total / num_elems_masked)
locals:
num_elems_masked = <local> Tensor{'reduce_sum', [], dtype='int64'}
num_elems_total = <local> Tensor{'mul', [], dtype='int32'}
inverse = <local> True
File "/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/tensor/_tensor_op_overloads.py", line 85, in _TensorOpOverloadsMixin.__truediv__
line: return _rf().combine(self, "/", other)
locals:
_rf = <global> <function _rf at 0x14a2f41be3e0>
combine = <not found>
self = <local> Tensor{'mul', [], dtype='int32'}
other = <local> Tensor{'reduce_sum', [], dtype='int64'}
File "/rwthfs/rz/cluster/home/tt201262/setups/2024-10-11-denosing_lm/recipe/returnn/returnn/frontend/math_.py", line 209, in combine
line: raise ValueError(
"Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float."
)
locals:
ValueError = <builtin> <class 'ValueError'>
ValueError: Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float.
You see in this log, g++ fails, so the native RF helpers (_returnn_frontend_native) are not used, and the pure Python logic for rf.combine is used. The pure Python rf.combine is a bit more strict, and does not allow int / int, while the native RF helpers do not check this, and just let PyTorch do this, and then takes over the dtype from PyTorch.
Btw, I think I saw some similar problems before, where the native RF helpers would always behave like allow_broadcast_all_sources=True, but the pure Python logic does not. I thought I filled some issue on this somewhere but I don't find it now...
Note, the reason g++ did not work here: Python was not found. One solution was module load Python/3.12.3, or probably also putting the right Python into the $PATH. But the reason that g++ did not work is not really the point of the issue here. The point of the issue is that the native RF helpers behave different than the pure Python code.