returnn
returnn copied to clipboard
OpCodeCompiler/TFNativeUtilCompiler, better C++ standard version detection, use get_compile_flags
RETURNN now has TF 2.10 support and we also use that for CI (#1160).
TF 2.10 by default uses C++17 now (https://github.com/tensorflow/tensorflow/commit/8ea5ed0c392b329a3e0481a3f1f7b0ca86821b84) and thus this is also used for the official binary pip package. Earlier version used C++14 (and even earlier versions C++11).
We need to use the right C++ standard version for compiling ops (see e.g. https://github.com/abseil/abseil-cpp/issues/211, https://github.com/abseil/abseil-cpp/issues/606, https://github.com/tensorflow/serving/issues/1935).
Up to (and including) TF 2.10, TF does not provide this information. So in #1160 we just guess based on TF version, and if it is TF 2.10, we use C++17. This is incorrect if a user compiled TF from source and used a different C++ standard version. But not sure really if we can do much about it. I asked on StackOverflow how to detect the C++ standard version directly from the TF lib but it seems there is no good way. But maybe someone has an idea?
Actually, in the current TF master, TF does provide this information now. This was added here: https://github.com/tensorflow/tensorflow/pull/57468. Now, when tf.sysconfig.get_compile_flags() is used, it would provide this flag (e.g. --std=c++17). So probably any upcoming TF release will have this information.
This is actually the other aspect I wanted to address here in this issue: We currently don't use tf.sysconfig.get_compile_flags() at all. We manually add "-D_GLIBCXX_USE_CXX11_ABI=%i" % (1 if self.use_cxx11_abi else 0). However, we probably should use get_compile_flags instead. Although this might not be available in all older TF versions, so we still need the fallback.
Hi @albertz , this is still not working for me. I still get a symbol error on the demos. I started by deleting my existing repos, then did a pip install. Is there a compiled code cache somewhere?
1070 10/20/22 10:15:49 git clone https://github.com/rwth-i6/returnn_common.git
1071 10/20/22 10:16:00 git clone https://github.com/rwth-i6/returnn.git
1072 10/20/22 10:16:14 git clone https://github.com/rwth-i6/returnn.wiki.git
1073 10/20/22 10:16:44 cd ~/venv/
1076 10/20/22 10:17:13 python -m venv ~/venv/returnn
1077 10/20/22 10:17:33 cd ~/tmp/repo/
1079 10/20/22 10:17:51 . ~/venv/returnn/bin/activate
1080 10/20/22 10:17:56 pip install -U pip
1081 10/20/22 10:18:13 pip install returnn
1083 10/20/22 10:18:33 cd returnn
1086 10/20/22 10:18:39 pip install -r requirements.txt
1087 10/20/22 10:18:58 pip install tensorflow
Then:
./rnn.py demos/demo-tf-native-lstm2.12ax.config
EXCEPTION
Traceback (most recent call last):
File "/home/braddock/tmp/repo/returnn/./rnn.py", line 11, in <module>
line: main()
locals:
main = <local> <function main at 0x7f87b0929750>
File "/home/braddock/tmp/repo/returnn/returnn/__main__.py", line 669, in main
line: init(command_line_options=argv[1:])
locals:
init = <global> <function init at 0x7f87b0929480>
command_line_options = <not found>
argv = <local> ['./rnn.py', 'demos/demo-tf-native-lstm2.12ax.config'], _[0]: {len = 8}
File "/home/braddock/tmp/repo/returnn/returnn/__main__.py", line 405, in init
line: init_backend_engine()
locals:
init_backend_engine = <global> <function init_backend_engine at 0x7f87b09293f0>
File "/home/braddock/tmp/repo/returnn/returnn/__main__.py", line 374, in init_backend_engine
line: tf_util.print_available_devices(tf_session_opts=tf_session_opts, file=log.v2)
locals:
tf_util = <local> <module 'returnn.tf.util.basic' from '/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py'>
tf_util.print_available_devices = <local> <function print_available_devices at 0x7f876c88c3a0>
tf_session_opts = <local> {'gpu_options': {'per_process_gpu_memory_fraction': 0.1}}
file = <not found>
log = <global> <returnn.log.Log object at 0x7f87c29a7520>
log.v2 = <global> <returnn.log.Stream object at 0x7f87b0936800>
File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 1129, in print_available_devices
line: devs = get_tf_list_local_devices(tf_session_opts=tf_session_opts, file=file)
locals:
devs = <not found>
get_tf_list_local_devices = <global> <function get_tf_list_local_devices at 0x7f876c88c040>
tf_session_opts = <local> {'gpu_options': {'per_process_gpu_memory_fraction': 0.1}}
file = <local> <returnn.log.Stream object at 0x7f87b0936800>
File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 1086, in get_tf_list_local_devices
line: dev.set_physical_device_desc(session=session)
locals:
dev = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
dev.set_physical_device_desc = <local> <bound method _DeviceAttributes.set_physical_device_desc of <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>>
session = <local> <tensorflow.python.client.session.Session object at 0x7f876e3c3f70>
File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 1028, in _DeviceAttributes.set_physical_device_desc
line: physical_device_desc = session.run(get_device_attr(self.name))
locals:
physical_device_desc = <not found>
session = <local> <tensorflow.python.client.session.Session object at 0x7f876e3c3f70>
session.run = <local> <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7f876e3c3f70>>
get_device_attr = <global> <function get_device_attr at 0x7f876c8a80d0>
self = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
self.name = <local> '/job:localhost/replica:0/task:0/device:CPU:0', len = 44
File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 6126, in get_device_attr
line: return _DeviceAttrMod.get_device_attr()
locals:
_DeviceAttrMod = <global> <class 'returnn.tf.util.basic._DeviceAttrMod'>
_DeviceAttrMod.get_device_attr = <global> <bound method _DeviceAttrMod.get_device_attr of <class 'returnn.tf.util.basic._DeviceAttrMod'>>
File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 6114, in _DeviceAttrMod.get_device_attr
line: return cls.get_mod().get_device_attr()
locals:
cls = <local> <class 'returnn.tf.util.basic._DeviceAttrMod'>
cls.get_mod = <local> <bound method _DeviceAttrMod.get_mod of <class 'returnn.tf.util.basic._DeviceAttrMod'>>
get_device_attr = <global> <function get_device_attr at 0x7f876c8a80d0>
File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 6103, in _DeviceAttrMod.get_mod
line: tf_mod = compiler.load_tf_module()
locals:
tf_mod = <not found>
compiler = <local> <OpCodeCompiler 'GetDeviceAttr' in '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6'>
compiler.load_tf_module = <local> <bound method OpCodeCompiler.load_tf_module of <OpCodeCompiler 'GetDeviceAttr' in '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6'>>
File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 2921, in OpCodeCompiler.load_tf_module
line: self._tf_mod = tf.load_op_library(self._so_filename)
locals:
self = <local> <OpCodeCompiler 'GetDeviceAttr' in '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6'>
self._tf_mod = <local> None
tf = <global> <module 'tensorflow' from '/home/braddock/venv/returnn/lib/python3.10/site-packages/tensorflow/__init__.py'>
tf.load_op_library = <global> <function load_op_library at 0x7f876e408dc0>
self._so_filename = <local> '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6/GetDeviceAttr.so', len = 80
File "/home/braddock/venv/returnn/lib/python3.10/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
line: lib_handle = py_tf.TF_LoadLibrary(library_filename)
locals:
lib_handle = <not found>
py_tf = <global> <module 'tensorflow.python.client.pywrap_tf_session' from '/home/braddock/venv/returnn/lib/python3.10/site-packages/tensorflow/python/client/pywrap_tf_session.py'>
py_tf.TF_LoadLibrary = <global> <built-in method TF_LoadLibrary of PyCapsule object at 0x7f87b01cd8c0>
library_filename = <local> '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6/GetDeviceAttr.so', len = 80
NotFoundError: /var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6/GetDeviceAttr.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl12lts_2022062311string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS9_EE
I tried with pip uninstall returnn with the same result
Ah, removal of
rm -rvf /var/tmp/braddock/returnn_*
fixed it. it appears to be working now. Sorry for the distraction.
Ah yea, many things such as the code itself and other flags are part of the hash for these cache files, but this particular flag was not part, thus you needed to remove the cache to force the rebuild. Maybe we also should add these flags to the hash as well.
Have you considered ccache? It will do the caching right and is completely unintrusive.
We don't really want to have any other dependencies, except of GCC itself. I also don't think that ccache would really give us any benefit over what we currently have.
ccache's only dependency is gcc as well. It will save hundreds of lines of code and complexity. It will also detect TF header version changes.
Well ccache is still another separate dependency and usually not installed by default, but we really want to avoid further dependencies and we also want that RETURNN basically "just works" on most standard developer systems, e.g. some Ubuntu with GCC installed, or MacOSX with their dev tools.
Despite, our code for this is really simple, maybe sth like 50 lines of code, or less, and basically just works. I don't really think that ccache will make it simpler, maybe even the opposite.
Despite, it's also an advantage that we explicitly define what's part of the hash. That maybe lead to your problem here (which was really a rare case), but on the other side, this makes it fast. We don't want to check all the hundreds of header files again and again. Esp because many users of RETURNN have a slow filesystem.
We also never really had problems with the caching logic so far, except maybe your case now, but this was also simple to fix, and usually does not happen.
#901 is related to this