returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Compiling TensorFlow graphs out of returnn_common networks

Open NeoLegends opened this issue 2 years ago • 2 comments

Hi!

I'm trying out using returnn_common together with the hybrid pipeline from https://github.com/rwth-i6/i6_experiments/blob/main/common/setups/rasr/hybrid_system.py. I think I have config generation using the serializer from i6_experiments working. However, for decoding RASR needs a compiled TF graph, which, in the current hybrid setup is created by the CompileTFGraphJob dumping the returnn-config to a file, and running that through https://github.com/rwth-i6/returnn/blob/master/tools/compile_tf_graph.py.

When the script sees a returnn_common-generated config it runs into an assert because the network is defined using the get_network function now, and the network with the dict key no longer exists:

EXCEPTION
Traceback (most recent call last):
  File "/u/mgunz/setups/2022-07--baselines/recipe/returnn/tools/compile_tf_graph.py", line 1605, in <module>
    line: main(sys.argv)
    locals:
      main = <local> <function main at 0x7ff2820be790>
      sys = <local> <module 'sys' (built-in)>
      sys.argv = <local> ['/u/mgunz/setups/2022-07--baselines/recipe/returnn/tools/compile_tf_graph.py', 'returnn.config', '--train=0', '--eval=0', '--search=0', '--verbosity=4', '--output_file=/u/mgunz/setups/2022-07--baselines/work/i6_core/returnn/compile/CompileTFGraphJob.AZjztrn186Qt/output/graph.meta', '--output_fil..., len = 9, _[0]: {len = 75}
  File "/u/mgunz/setups/2022-07--baselines/recipe/returnn/tools/compile_tf_graph.py", line 1511, in main
    line: assert 'network' in config.typed_dict
    locals:
      config = <global> <returnn.config.Config object at 0x7ff28207ffd0>
      config.typed_dict = <global> {'use_tensorflow': True, 'log': None, 'log_verbosity': 4, 'task': '/u/mgunz/setups/2022-07--baselines/recipe/returnn/tools/compile_tf_graph.py', 'device': 'cpu', 'config': {}, '__file__': 'returnn.config', '__name__': '__returnn_config__', '__package__': 'returnn', '__builtins__': {'__name__': 'b..., len = 50
AssertionError
[2022-08-02 12:05:04,962] ERROR: Executed command failed:
[2022-08-02 12:05:04,963] ERROR: Cmd: ['/u/rossenbach/bin/returnn_tf2.3_launcher.sh', '/u/mgunz/setups/2022-07--baselines/recipe/returnn/tools/compile_tf_graph.py', 'returnn.config', '--train=0', '--eval=0', '--search=0', '--verbosity=4', '--output_file=/u/mgunz/setups/2022-07--baselines/work/i6_core/returnn/compile/CompileTFGraphJob.AZjztrn186Qt/output/graph.meta', '--output_file_model_params_list=model_params', '--output_file_state_vars_list=state_vars']
[2022-08-02 12:05:04,963] ERROR: Args: (1, ['/u/rossenbach/bin/returnn_tf2.3_launcher.sh', '/u/mgunz/setups/2022-07--baselines/recipe/returnn/tools/compile_tf_graph.py', 'returnn.config', '--train=0', '--eval=0', '--search=0', '--verbosity=4', '--output_file=/u/mgunz/setups/2022-07--baselines/work/i6_core/returnn/compile/CompileTFGraphJob.AZjztrn186Qt/output/graph.meta', '--output_file_model_params_list=model_params', '--output_file_state_vars_list=state_vars'])
[2022-08-02 12:05:04,963] ERROR: Return-Code: 1
[2022-08-02 12:05:04,966] INFO: Max resources: Run time: 0:00:08 CPU: 106.7% RSS: 316MB VMS: 1.99GB

This poses the question whether to adapt the script to be able to handle returnn_common-style get_network functions or whether the CompileTFGraphJob should be adapted to output the network as a classical dict to the file, so the script continues working. I don't think I know enough about the script to make this decision, but this seems to be something that needs to be resolved for the hybrid pipeline to be at all compatible with returnn_common.

cc @christophmluscher @Atticus1806

NeoLegends avatar Aug 02 '22 12:08 NeoLegends

Two options:

The compile_tf_graph.py script should also handle the get_network function. However, then you also need to specify the epoch.

Or when generating the config for compile_tf_graph.py, you directly would create a network dict.

albertz avatar Aug 02 '22 12:08 albertz

Copying my comment from the duplicate #1103:

The behavior should be backward compatible. I.e. epoch=None by default or so, and that would always use network when available, and otherwise throw an error (just as it is right now). Only with epoch is not None, it would also check get_network, and then maybe also the pretrain logic.

albertz avatar Sep 23 '22 09:09 albertz