optimum Optimum Not Working With LongT5 Summarization

System Info

- `optimum` version: 1.2.3 (installed via Github installation)
- `transformers` version: 4.20.1
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.11.0+cu113 (False)
- Tensorflow version (GPU?): 2.8.2 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no

Who can help?

@lewtun @michaelbenayoun @JingyaHuang

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Essentially, I am unable to run LongT5 for anything longer than about ~25 tokens with the summarization pipeline. I get an error regarding a LessOrEqual node ending in "Can broadcast 0 by 0 or 1. num is invalid", where num is a value that increases the longer the sequence I provide as input to the pipeline.

If it helps, I get the same error when using Huggingface directly with !python -m transformers.onnx --model=my-custom-pretrained-summarizer --feature seq2seq-lm.

!pip install transformers !pip install transformers[onnx] !python -m pip install git+https://github.com/huggingface/optimum.git !python -m pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime] !pip install datasets

from optimum.onnxruntime import ORTModelForSeq2SeqLM

model = ORTModelForSeq2SeqLM.from_pretrained("my-custom-pretrained-summarizer", from_transformers=True)

Output: /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:180: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. global_block_ids_lower_bound = torch.tensor(-1.0, dtype=global_block_ids.dtype, device=global_block_ids.device) /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:188: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). num_globals = seq_len // global_block_size /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:190: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if num_globals > 0: /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:217: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. block_ids >= 0, torch.tensor(global_seq_len, dtype=block_ids.dtype, device=block_ids.device) /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:217: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad(True), rather than torch.tensor(sourceTensor). block_ids >= 0, torch.tensor(global_seq_len, dtype=block_ids.dtype, device=block_ids.device) /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:219: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! one_hot_block_ids = nn.functional.one_hot(block_ids.type(torch.int64), global_seq_len + 1)[:, :, :-1] /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:84: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if x.shape[dim] % block_len != 0: /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:67: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if not all(x.shape): /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:86: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). num_blocks = x.shape[dim] // block_len /usr/local/lib/python3.7/dist-packages/transformers/models/longt5/modeling_longt5.py:89: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if 0 in output_shape: /usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:770: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers. "The device argument is deprecated and will be removed in v5 of Transformers.", FutureWarning /usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:781: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if causal_mask.shape[1] < attention_mask.shape[1]:

from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained('google/long-t5-tglobal-large')

onnx_summarization = pipeline("summarization", model=model, tokenizer=tokenizer)

text = # Something longer than ~25 tokens pred = onnx_summarization(text)

/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in run(self, output_names, input_feed, run_options) 190 output_names = [output.name for output in self._outputs_meta] 191 try: --> 192 return self._sess.run(output_names, input_feed, run_options) 193 except C.EPFail as err: 194 if self._enable_fallback:

RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running LessOrEqual node. Name:'LessOrEqual_648' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:603 onnxruntime::Broadcaster::Broadcaster(gsl::span, gsl::span) largest <= 1 was false. Can broadcast 0 by 0 or 1. 16 is invalid.

Expected behavior

[{'summary_text': '<something that should look the same or very similar to what my PyTorch model outputs'}]

Jul 12 '22 15:07 reelmath

cc @echarlaix

Jul 14 '22 08:07 philschmid

Hi @reelmath,

I was not able to reproduce the error you are describing by running the following code:

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

model_name = "google/long-t5-local-base"
model = ORTModelForSeq2SeqLM.from_pretrained(model_name, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
onnx_summarization = pipeline("summarization", model=model, tokenizer=tokenizer)
text = # Something longer than ~25 tokens
pred = onnx_summarization(text)

Can you confirm whether you are able to run it ? (to determine whether the problem comes from the specific model you are exporting)

Jul 20 '22 14:07 echarlaix

Hi @echarlaix,

I tried the following four experiments:

longt5-local-base: no error, just as you found. longt5-tglobal-base: RuntimeException Traceback (most recent call last) in () 7 8 text = # something longer than ~25 tokens ----> 9 pred = onnx_summarization(text)

11 frames /usr/local/lib/python3.7/dist-packages/transformers/pipelines/text2text_generation.py in call(self, *args, **kwargs) 233 ids of the summary. 234 """ --> 235 return super().call(*args, **kwargs) 236 237 def check_inputs(self, input_length: int, min_length: int, max_length: int) -> bool:

/usr/local/lib/python3.7/dist-packages/transformers/pipelines/text2text_generation.py in call(self, *args, **kwargs) 135 """ 136 --> 137 result = super().call(*args, **kwargs) 138 if ( 139 isinstance(args[0], list)

/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py in call(self, inputs, num_workers, batch_size, *args, **kwargs) 1041 return self.iterate(inputs, preprocess_params, forward_params, postprocess_params) 1042 else: -> 1043 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params) 1044 1045 def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):

/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py in run_single(self, inputs, preprocess_params, forward_params, postprocess_params) 1048 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params): 1049 model_inputs = self.preprocess(inputs, **preprocess_params) -> 1050 model_outputs = self.forward(model_inputs, **forward_params) 1051 outputs = self.postprocess(model_outputs, **postprocess_params) 1052 return outputs

/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py in forward(self, model_inputs, **forward_params) 957 with inference_context(): 958 model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device) --> 959 model_outputs = self._forward(model_inputs, **forward_params) 960 model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu")) 961 else:

/usr/local/lib/python3.7/dist-packages/transformers/pipelines/text2text_generation.py in _forward(self, model_inputs, **generate_kwargs) 157 generate_kwargs["max_length"] = generate_kwargs.get("max_length", self.model.config.max_length) 158 self.check_inputs(input_length, generate_kwargs["min_length"], generate_kwargs["max_length"]) --> 159 output_ids = self.model.generate(**model_inputs, **generate_kwargs) 160 out_b = output_ids.shape[0] 161 if self.framework == "pt":

/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs) 25 def decorate_context(*args, **kwargs): 26 with self.clone(): ---> 27 return func(*args, **kwargs) 28 return cast(F, decorate_context) 29

/usr/local/lib/python3.7/dist-packages/transformers/generation_utils.py in generate(self, inputs, max_length, min_length, do_sample, early_stopping, num_beams, temperature, top_k, top_p, typical_p, repetition_penalty, bad_words_ids, force_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, encoder_no_repeat_ngram_size, num_return_sequences, max_time, max_new_tokens, decoder_start_token_id, use_cache, num_beam_groups, diversity_penalty, prefix_allowed_tokens_fn, logits_processor, renormalize_logits, stopping_criteria, constraints, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, forced_bos_token_id, forced_eos_token_id, remove_invalid_values, synced_gpus, exponential_decay_length_penalty, **model_kwargs) 1180 # and added to model_kwargs 1181 model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation( -> 1182 inputs_tensor, model_kwargs, model_input_name 1183 ) 1184

/usr/local/lib/python3.7/dist-packages/transformers/generation_utils.py in _prepare_encoder_decoder_kwargs_for_generation(self, inputs_tensor, model_kwargs, model_input_name) 523 encoder_kwargs["return_dict"] = True 524 encoder_kwargs[model_input_name] = inputs_tensor --> 525 model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs) 526 527 return model_kwargs

/usr/local/lib/python3.7/dist-packages/optimum/onnxruntime/modeling_seq2seq.py in call(self, *args, **kwargs) 467 468 def call(self, *args, **kwargs): --> 469 return self.forward(*args, **kwargs) 470 471

/usr/local/lib/python3.7/dist-packages/optimum/onnxruntime/modeling_seq2seq.py in forward(self, input_ids, attention_mask, **kwargs) 461 462 # Run inference --> 463 outputs = self.session.run(None, onnx_inputs) 464 last_hidden_state = torch.from_numpy(outputs[self.output_names["last_hidden_state"]]).to(self._device) 465

/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py in run(self, output_names, input_feed, run_options) 190 output_names = [output.name for output in self._outputs_meta] 191 try: --> 192 return self._sess.run(output_names, input_feed, run_options) 193 except C.EPFail as err: 194 if self._enable_fallback:

RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running LessOrEqual node. Name:'LessOrEqual_1891' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:603 onnxruntime::Broadcaster::Broadcaster(gsl::span, gsl::span) largest <= 1 was false. Can broadcast 0 by 0 or 1. 3 is invalid.

-----------------------------------------------------------------------------------------------------------------------------------

For both longt5-local-large and longt5-tglobal-large, I get an error even earlier:

RuntimeError Traceback (most recent call last) in () 1 from optimum.onnxruntime import ORTModelForSeq2SeqLM 2 ----> 3 model = ORTModelForSeq2SeqLM.from_pretrained("google/long-t5-tglobal-large", from_transformers=True) #"gdrive/MyDrive/checkpoint-280"

8 frames /usr/local/lib/python3.7/dist-packages/optimum/modeling_base.py in from_pretrained(cls, model_id, from_transformers, force_download, use_auth_token, cache_dir, **model_kwargs) 205 force_download=force_download, 206 use_auth_token=use_auth_token, --> 207 **model_kwargs, 208 ) 209 else:

/usr/local/lib/python3.7/dist-packages/optimum/onnxruntime/modeling_seq2seq.py in _from_transformers(cls, model_id, save_dir, use_auth_token, revision, force_download, cache_dir, **kwargs) 391 config=onnx_config_encoder, 392 opset=onnx_opset, --> 393 output=save_dir.joinpath(ONNX_ENCODER_NAME), 394 ) 395

/usr/local/lib/python3.7/dist-packages/transformers/onnx/convert.py in export(preprocessor, model, config, opset, output, tokenizer, device) 333 334 if is_torch_available() and issubclass(type(model), PreTrainedModel): --> 335 return export_pytorch(preprocessor, model, config, opset, output, tokenizer=tokenizer, device=device) 336 elif is_tf_available() and issubclass(type(model), TFPreTrainedModel): 337 return export_tensorflow(preprocessor, model, config, opset, output, tokenizer=tokenizer)

/usr/local/lib/python3.7/dist-packages/transformers/onnx/convert.py in export_pytorch(preprocessor, model, config, opset, output, tokenizer, device) 196 dynamic_axes={name: axes for name, axes in chain(config.inputs.items(), config.outputs.items())}, 197 do_constant_folding=True, --> 198 opset_version=opset, 199 ) 200

/usr/local/lib/python3.7/dist-packages/torch/onnx/init.py in export(model, args, f, export_params, verbose, training, input_names, output_names, operator_export_type, opset_version, do_constant_folding, dynamic_axes, keep_initializers_as_inputs, custom_opsets, export_modules_as_functions) 363 keep_initializers_as_inputs, 364 custom_opsets, --> 365 export_modules_as_functions, 366 ) 367

/usr/local/lib/python3.7/dist-packages/torch/onnx/utils.py in export(model, args, f, export_params, verbose, training, input_names, output_names, operator_export_type, opset_version, do_constant_folding, dynamic_axes, keep_initializers_as_inputs, custom_opsets, export_modules_as_functions) 176 keep_initializers_as_inputs=keep_initializers_as_inputs, 177 custom_opsets=custom_opsets, --> 178 export_modules_as_functions=export_modules_as_functions, 179 ) 180

/usr/local/lib/python3.7/dist-packages/torch/onnx/utils.py in _export(model, args, f, export_params, verbose, training, input_names, output_names, operator_export_type, export_type, opset_version, do_constant_folding, dynamic_axes, keep_initializers_as_inputs, fixed_batch_size, custom_opsets, add_node_names, onnx_shape_inference, export_modules_as_functions) 1082 fixed_batch_size=fixed_batch_size, 1083 training=training, -> 1084 dynamic_axes=dynamic_axes, 1085 ) 1086

/usr/local/lib/python3.7/dist-packages/torch/onnx/utils.py in _model_to_graph(model, args, verbose, input_names, output_names, operator_export_type, do_constant_folding, _disable_torch_constant_prop, fixed_batch_size, training, dynamic_axes) 737 dynamic_axes=dynamic_axes, 738 input_names=input_names, --> 739 module=module, 740 ) 741 except Exception as e:

/usr/local/lib/python3.7/dist-packages/torch/onnx/utils.py in _optimize_graph(graph, operator_export_type, _disable_torch_constant_prop, fixed_batch_size, params_dict, dynamic_axes, input_names, module) 307 _C._jit_pass_onnx_lint(graph) 308 graph = _C._jit_pass_onnx(graph, operator_export_type) --> 309 _C._jit_pass_onnx_lint(graph) 310 _C._jit_pass_lint(graph) 311

RuntimeError: Unable to cast from non-held to held instance (T& to Holder<T>) (compile in debug mode for type information)

Jul 20 '22 15:07 reelmath

I was able to reproduce your error with the model google/long-t5-tglobal-base, it looks like the issue comes from the ONNX export. When exporting the model, the default sequence length is 8, and during the model tracing num_globals > 0 is being converted to False, which results in errors for inputs with sequence length higher than global_block_size*2. If you need a LongT5 model with transient-global attention, you can try to increase the sequence length when exporting the model (higher than global_block_size), but it might results in problems for short sequence length (< global_block_size). Because this issue concerns the ONNX export, could you open an issue on transformers in order to move the discussion there?

Jul 21 '22 12:07 echarlaix