returnn OpCodeCompiler/TFNativeUtilCompiler, better C++ standard version detection, use get_compile

RETURNN now has TF 2.10 support and we also use that for CI (#1160).

TF 2.10 by default uses C++17 now (https://github.com/tensorflow/tensorflow/commit/8ea5ed0c392b329a3e0481a3f1f7b0ca86821b84) and thus this is also used for the official binary pip package. Earlier version used C++14 (and even earlier versions C++11).

We need to use the right C++ standard version for compiling ops (see e.g. https://github.com/abseil/abseil-cpp/issues/211, https://github.com/abseil/abseil-cpp/issues/606, https://github.com/tensorflow/serving/issues/1935).

Up to (and including) TF 2.10, TF does not provide this information. So in #1160 we just guess based on TF version, and if it is TF 2.10, we use C++17. This is incorrect if a user compiled TF from source and used a different C++ standard version. But not sure really if we can do much about it. I asked on StackOverflow how to detect the C++ standard version directly from the TF lib but it seems there is no good way. But maybe someone has an idea?

Actually, in the current TF master, TF does provide this information now. This was added here: https://github.com/tensorflow/tensorflow/pull/57468. Now, when tf.sysconfig.get_compile_flags() is used, it would provide this flag (e.g. --std=c++17). So probably any upcoming TF release will have this information.

This is actually the other aspect I wanted to address here in this issue: We currently don't use tf.sysconfig.get_compile_flags() at all. We manually add "-D_GLIBCXX_USE_CXX11_ABI=%i" % (1 if self.use_cxx11_abi else 0). However, we probably should use get_compile_flags instead. Although this might not be available in all older TF versions, so we still need the fallback.

Oct 20 '22 13:10 albertz

Hi @albertz , this is still not working for me. I still get a symbol error on the demos. I started by deleting my existing repos, then did a pip install. Is there a compiled code cache somewhere?

1070  10/20/22 10:15:49 git clone https://github.com/rwth-i6/returnn_common.git
 1071  10/20/22 10:16:00 git clone https://github.com/rwth-i6/returnn.git
 1072  10/20/22 10:16:14 git clone https://github.com/rwth-i6/returnn.wiki.git
 1073  10/20/22 10:16:44 cd ~/venv/
 1076  10/20/22 10:17:13 python -m venv ~/venv/returnn
 1077  10/20/22 10:17:33 cd ~/tmp/repo/
 1079  10/20/22 10:17:51 . ~/venv/returnn/bin/activate
 1080  10/20/22 10:17:56 pip install -U pip
 1081  10/20/22 10:18:13 pip install returnn
 1083  10/20/22 10:18:33 cd returnn
 1086  10/20/22 10:18:39 pip install -r requirements.txt 
 1087  10/20/22 10:18:58 pip install tensorflow

Then:

    ./rnn.py demos/demo-tf-native-lstm2.12ax.config 
EXCEPTION
Traceback (most recent call last):
  File "/home/braddock/tmp/repo/returnn/./rnn.py", line 11, in <module>
    line: main()
    locals:
      main = <local> <function main at 0x7f87b0929750>
  File "/home/braddock/tmp/repo/returnn/returnn/__main__.py", line 669, in main
    line: init(command_line_options=argv[1:])
    locals:
      init = <global> <function init at 0x7f87b0929480>
      command_line_options = <not found>
      argv = <local> ['./rnn.py', 'demos/demo-tf-native-lstm2.12ax.config'], _[0]: {len = 8}
  File "/home/braddock/tmp/repo/returnn/returnn/__main__.py", line 405, in init
    line: init_backend_engine()
    locals:
      init_backend_engine = <global> <function init_backend_engine at 0x7f87b09293f0>
  File "/home/braddock/tmp/repo/returnn/returnn/__main__.py", line 374, in init_backend_engine
    line: tf_util.print_available_devices(tf_session_opts=tf_session_opts, file=log.v2)
    locals:
      tf_util = <local> <module 'returnn.tf.util.basic' from '/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py'>
      tf_util.print_available_devices = <local> <function print_available_devices at 0x7f876c88c3a0>
      tf_session_opts = <local> {'gpu_options': {'per_process_gpu_memory_fraction': 0.1}}
      file = <not found>
      log = <global> <returnn.log.Log object at 0x7f87c29a7520>
      log.v2 = <global> <returnn.log.Stream object at 0x7f87b0936800>
  File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 1129, in print_available_devices
    line: devs = get_tf_list_local_devices(tf_session_opts=tf_session_opts, file=file)
    locals:
      devs = <not found>
      get_tf_list_local_devices = <global> <function get_tf_list_local_devices at 0x7f876c88c040>
      tf_session_opts = <local> {'gpu_options': {'per_process_gpu_memory_fraction': 0.1}}
      file = <local> <returnn.log.Stream object at 0x7f87b0936800>
  File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 1086, in get_tf_list_local_devices
    line: dev.set_physical_device_desc(session=session)
    locals:
      dev = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
      dev.set_physical_device_desc = <local> <bound method _DeviceAttributes.set_physical_device_desc of <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>>
      session = <local> <tensorflow.python.client.session.Session object at 0x7f876e3c3f70>
  File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 1028, in _DeviceAttributes.set_physical_device_desc
    line: physical_device_desc = session.run(get_device_attr(self.name))
    locals:
      physical_device_desc = <not found>
      session = <local> <tensorflow.python.client.session.Session object at 0x7f876e3c3f70>
      session.run = <local> <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7f876e3c3f70>>
      get_device_attr = <global> <function get_device_attr at 0x7f876c8a80d0>
      self = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
      self.name = <local> '/job:localhost/replica:0/task:0/device:CPU:0', len = 44
  File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 6126, in get_device_attr
    line: return _DeviceAttrMod.get_device_attr()
    locals:
      _DeviceAttrMod = <global> <class 'returnn.tf.util.basic._DeviceAttrMod'>
      _DeviceAttrMod.get_device_attr = <global> <bound method _DeviceAttrMod.get_device_attr of <class 'returnn.tf.util.basic._DeviceAttrMod'>>
  File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 6114, in _DeviceAttrMod.get_device_attr
    line: return cls.get_mod().get_device_attr()
    locals:
      cls = <local> <class 'returnn.tf.util.basic._DeviceAttrMod'>
      cls.get_mod = <local> <bound method _DeviceAttrMod.get_mod of <class 'returnn.tf.util.basic._DeviceAttrMod'>>
      get_device_attr = <global> <function get_device_attr at 0x7f876c8a80d0>
  File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 6103, in _DeviceAttrMod.get_mod
    line: tf_mod = compiler.load_tf_module()
    locals:
      tf_mod = <not found>
      compiler = <local> <OpCodeCompiler 'GetDeviceAttr' in '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6'>
      compiler.load_tf_module = <local> <bound method OpCodeCompiler.load_tf_module of <OpCodeCompiler 'GetDeviceAttr' in '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6'>>
  File "/home/braddock/tmp/repo/returnn/returnn/tf/util/basic.py", line 2921, in OpCodeCompiler.load_tf_module
    line: self._tf_mod = tf.load_op_library(self._so_filename)
    locals:
      self = <local> <OpCodeCompiler 'GetDeviceAttr' in '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6'>
      self._tf_mod = <local> None
      tf = <global> <module 'tensorflow' from '/home/braddock/venv/returnn/lib/python3.10/site-packages/tensorflow/__init__.py'>
      tf.load_op_library = <global> <function load_op_library at 0x7f876e408dc0>
      self._so_filename = <local> '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6/GetDeviceAttr.so', len = 80
  File "/home/braddock/venv/returnn/lib/python3.10/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
    line: lib_handle = py_tf.TF_LoadLibrary(library_filename)
    locals:
      lib_handle = <not found>
      py_tf = <global> <module 'tensorflow.python.client.pywrap_tf_session' from '/home/braddock/venv/returnn/lib/python3.10/site-packages/tensorflow/python/client/pywrap_tf_session.py'>
      py_tf.TF_LoadLibrary = <global> <built-in method TF_LoadLibrary of PyCapsule object at 0x7f87b01cd8c0>
      library_filename = <local> '/var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6/GetDeviceAttr.so', len = 80
NotFoundError: /var/tmp/braddock/returnn_tf_cache/ops/GetDeviceAttr/008f663ee6/GetDeviceAttr.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl12lts_2022062311string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS9_EE

Oct 20 '22 17:10 braddockcg

I tried with pip uninstall returnn with the same result

Oct 20 '22 17:10 braddockcg

Ah, removal of

rm -rvf /var/tmp/braddock/returnn_*

fixed it. it appears to be working now. Sorry for the distraction.

Oct 20 '22 17:10 braddockcg

Ah yea, many things such as the code itself and other flags are part of the hash for these cache files, but this particular flag was not part, thus you needed to remove the cache to force the rebuild. Maybe we also should add these flags to the hash as well.

Oct 20 '22 18:10 albertz

Have you considered ccache? It will do the caching right and is completely unintrusive.

Oct 20 '22 18:10 braddockcg

We don't really want to have any other dependencies, except of GCC itself. I also don't think that ccache would really give us any benefit over what we currently have.

Oct 20 '22 18:10 albertz

ccache's only dependency is gcc as well. It will save hundreds of lines of code and complexity. It will also detect TF header version changes.

Oct 20 '22 19:10 braddockcg

Well ccache is still another separate dependency and usually not installed by default, but we really want to avoid further dependencies and we also want that RETURNN basically "just works" on most standard developer systems, e.g. some Ubuntu with GCC installed, or MacOSX with their dev tools.

Despite, our code for this is really simple, maybe sth like 50 lines of code, or less, and basically just works. I don't really think that ccache will make it simpler, maybe even the opposite.

Despite, it's also an advantage that we explicitly define what's part of the hash. That maybe lead to your problem here (which was really a rare case), but on the other side, this makes it fast. We don't want to check all the hundreds of header files again and again. Esp because many users of RETURNN have a slow filesystem.

We also never really had problems with the caching logic so far, except maybe your case now, but this was also simple to fix, and usually does not happen.

Oct 20 '22 21:10 albertz

#901 is related to this

Oct 26 '22 16:10 JackTemaki

returnn
returnn copied to clipboard

OpCodeCompiler/TFNativeUtilCompiler, better C++ standard version detection, use get_compile_flags

returnn returnn copied to clipboard

OpCodeCompiler/TFNativeUtilCompiler, better C++ standard version detection, use get_compile_flags

returnn
returnn copied to clipboard