transformers icon indicating copy to clipboard operation
transformers copied to clipboard

TF to ONNX export fails with large models

Open cchan-lm opened this issue 3 years ago • 4 comments

System Info

  • transformers version: 4.21.1
  • Platform: Linux-4.15.0-187-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.5
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): 2.7.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Run python -m transformers.onnx --model=gpt2-large --framework=tf onnx/

See error like below:

Traceback (most recent call last):
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tf2onnx/tf_loader.py", line 221, in from_trackable
    frozen_graph = from_function(concrete_func, inputs, outputs, large_model)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tf2onnx/tf_loader.py", line 280, in from_function
    raise e
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tf2onnx/tf_loader.py", line 273, in from_function
    frozen_func = convert_variables_to_constants_v2(func, lower_control_flow=False, aggressive_inlining=True)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/framework/convert_to_constants.py", line 1156, in convert_variables_to_constants_v2
    converted_input_indices)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/framework/convert_to_constants.py", line 1082, in _construct_concrete_function
    new_output_names)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 660, in function_from_graph_def
    wrapped_import = wrap_function(_imports_graph_def, [])
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 631, in wrap_function
    collections={}),
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 1143, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 87, in __call__
    return self.call_with_variable_creator_scope(self._fn)(*args, **kwargs)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 93, in wrapped
    return fn(*args, **kwargs)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 654, in _imports_graph_def
    importer.import_graph_def(graph_def, name="")
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 552, in new_func
    return func(*args, **kwargs)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/framework/importer.py", line 412, in import_graph_def
    producer_op_list=producer_op_list)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tensorflow/python/framework/importer.py", line 501, in _import_graph_def_internal
    with c_api_util.tf_buffer(graph_def.SerializeToString()) as serialized:
ValueError: Message tensorflow.GraphDef exceeds maximum protobuf size of 2GB: 3096993336

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/craig/.pyenv/versions/3.7.5/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/craig/.pyenv/versions/3.7.5/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/transformers/onnx/__main__.py", line 107, in <module>
    main()
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/transformers/onnx/__main__.py", line 94, in main
    args.output,
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/transformers/onnx/convert.py", line 338, in export
    return export_tensorflow(preprocessor, model, config, opset, output, tokenizer=tokenizer)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/transformers/onnx/convert.py", line 265, in export_tensorflow
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature, opset=opset)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tf2onnx/convert.py", line 493, in from_keras
    tf_loader.from_trackable(model, concrete_func, input_names, output_names, large_model)
  File "/home/craig/.pyenv/versions/tf-hf-test/lib/python3.7/site-packages/tf2onnx/tf_loader.py", line 224, in from_trackable
    raise ValueError(err_large_model)
ValueError: model exceeds maximum protobuf size of 2GB. Try setting large_model.

Expected behavior

Export should still be successful for large TF models. tf2onnx expects large_model to be passed in should the protobuf exceed 2 GB. Not sure if tf2onnx behavior will be changed, but maybe transformers can account for this before using tf2onnx?

cchan-lm avatar Aug 05 '22 18:08 cchan-lm

cc @JingyaHuang @michaelbenayoun

LysandreJik avatar Aug 09 '22 07:08 LysandreJik

If there are no onnx-level solutions, it may be due to TF1 code (embeddings) in our models -- see https://github.com/tensorflow/tensorflow/issues/45041

Rewriting embeddings into TF2 code is in our to do list, which may fix this issue.

gante avatar Aug 09 '22 09:08 gante

TF2ONNX offers the support for exporting large ONNX tensors with external files, however by adding the flag to the ONNX exporter of transformers, it doesn't work correctly for the moment:

  File "/home/ubuntu/anaconda3/envs/venv_onnx_large/lib/python3.9/site-packages/transformers/onnx/convert.py", line 338, in export
    return export_tensorflow(preprocessor, model, config, opset, output, tokenizer=tokenizer)
  File "/home/ubuntu/anaconda3/envs/venv_onnx_large/lib/python3.9/site-packages/transformers/onnx/convert.py", line 265, in export_tensorflow
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature, opset=opset, large_model=True)
  File "/home/ubuntu/anaconda3/envs/venv_onnx_large/lib/python3.9/site-packages/tf2onnx/convert.py", line 495, in from_keras
    model_proto, external_tensor_storage = _convert_common(
  File "/home/ubuntu/anaconda3/envs/venv_onnx_large/lib/python3.9/site-packages/tf2onnx/convert.py", line 165, in _convert_common
    g = process_tf_graph(tf_graph, const_node_values=const_node_values,
  File "/home/ubuntu/anaconda3/envs/venv_onnx_large/lib/python3.9/site-packages/tf2onnx/tfonnx.py", line 459, in process_tf_graph
    main_g, subgraphs = graphs_from_tf(tf_graph, input_names, output_names, shape_override, const_node_values,
  File "/home/ubuntu/anaconda3/envs/venv_onnx_large/lib/python3.9/site-packages/tf2onnx/tfonnx.py", line 499, in graphs_from_tf
    utils.check_io(input_names, output_names, output_shapes.keys())
  File "/home/ubuntu/anaconda3/envs/venv_onnx_large/lib/python3.9/site-packages/tf2onnx/utils.py", line 316, in check_io
    raise ValueError("Inputs/Outputs Not Found")
ValueError: Inputs/Outputs Not Found

Further investigation needs to be done from the TensorFlow side. And I will be happy to help with a PR to enable this in transformers' onnx tf exporter once we are sure that the large proto export features work correctly.

JingyaHuang avatar Aug 09 '22 10:08 JingyaHuang

If there are no onnx-level solutions, it may be due to TF1 code (embeddings) in our models -- see tensorflow/tensorflow#45041

Rewriting embeddings into TF2 code is in our to do list, which may fix this issue.

Didn't know that, ok, it seems that it is not just a problem from the limit of protobuf size then.

JingyaHuang avatar Aug 09 '22 10:08 JingyaHuang

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 05 '22 15:09 github-actions[bot]