optimum ONNX files for T5 model with text2text-generation-with-past task do not work

System Info

Reproduced on Mac, Python 3.11 and Google Colab / Python 3.10

optimum==1.14.0

Who can help?

@ michaelbenayoun

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Full colab here

When I export the model with past, it does not work:

!optimum-cli export onnx \
  --model jbochi/madlad400-3b-mt \
  --task text2text-generation-with-past \
  --optimize O3 \
  onnx/

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import T5Tokenizer

model = ORTModelForSeq2SeqLM.from_pretrained('./onnx', device="auto")
tokenizer = T5Tokenizer.from_pretrained('jbochi/madlad400-3b-mt')

text = "<2pt> I love pizza!"
inputs = tokenizer(text, return_tensors="pt", device=model.device)
outputs = model.generate(**inputs)
tokenizer.decode(outputs[0], skip_special_tokens=True)

It raises the following error:

---------------------------------------------------------------------------
InvalidArgument                           Traceback (most recent call last)
[<ipython-input-14-be8c5bdea41e>](https://localhost:8080/#) in <cell line: 3>()
      1 text = "<2pt> I love pizza!"
      2 inputs = tokenizer(text, return_tensors="pt", device=model.device)
----> 3 outputs = model.generate(**inputs)
      4 tokenizer.decode(outputs[0], skip_special_tokens=True)

7 frames
[/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py](https://localhost:8080/#) in run(self, output_names, input_feed, run_options)
    218             output_names = [output.name for output in self._outputs_meta]
    219         try:
--> 220             return self._sess.run(output_names, input_feed, run_options)
    221         except C.EPFail as err:
    222             if self._enable_fallback:

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: past_key_values.9.encoder.key for the following indices
 index: 3 Got: 64 Expected: 128
 Please fix either the inputs or the model.

Expected behavior

Without past / cache, it works:

If I convert this T5 model with no past / cache, it works:

!optimum-cli export onnx \
  --model jbochi/madlad400-3b-mt \
  --task text2text-generation \
  --optimize O3 \
  onnx-no-past/

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import T5Tokenizer

model = ORTModelForSeq2SeqLM.from_pretrained('./onnx-no-past', use_cache=False)
tokenizer = T5Tokenizer.from_pretrained('jbochi/madlad400-3b-mt')
text = "<2pt> I love pizza!"
inputs = tokenizer(text, return_tensors="pt", device=model.device)
outputs = model.generate(**inputs)
tokenizer.decode(outputs[0], skip_special_tokens=True)
# Eu amo pizza!

Thank you!

Nov 09 '23 01:11 jbochi

Met the same issue, any progress?

Nov 17 '23 03:11 trajepl

Seems it is related with this PR. https://github.com/huggingface/optimum/pull/1257

It worked when I removed the merged decoder.

Nov 17 '23 04:11 trajepl

Hi, I'm facing the same issue. It doesn't work when I remove the merged decoder. I used the same command as @jbochi

optimum-cli export onnx \
  --model madlad400-3b-mt \
  --task text2text-generation-with-past \
  --optimize O3 \
  onnx/

My onnx dir has decoder versions

decoder_model.onnx
decoder_model.onnx.data
decoder_model.onnx_data
decoder_with_past_model.onnx
decoder_with_past_model.onnx.data 
decoder_with_past_model.onnx_data
decoder_model_merged.onnx
decoder_model_merged.onnx_data

I checked the past_key_values field of the original model and also of the onnx model (from the model graph) and confirmed they both expect 128 size in the third dimension (seq len dimension) in the attention KV matrix.

I'm running optimum[onnxruntime-gpu]==1.16.2. The model is the same madlad t5 exported by jbochi above. As with jbochi it works If I load the model using model = ORTModelForSeq2SeqLM.from_pretrained(model_name, provider="CUDAExecutionProvider", task="text2text-generation", use_cache=False, use_io_binding=False). But I want to load the version with the KV cache to get some speedup.

Mar 25 '24 23:03 zm-twitter