optimum Running inference pipeline with Starcoderbase model with ONNX Optimization crashes

System Info

Optimum Version: 1.13.2
Platform: Ubuntu 22.04
Python Version: 3.10.2
Transformers Version: 4.34

Who can help?

@JingyaHuang @fxmarty @michaelbenayoun

Running inference pipeline with onnx optimized starcoder model or any model with multi query attention crashes. I have written very detailed comments in PR https://github.com/huggingface/optimum/pull/1042 which was meant to add support for this kind of models but I think the code is not safe/robust enough. I highlighted all of the potentially problematic cases and bugs. Please have a look at the comments there.

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

import torch
import transformers
import onnx.runtime as onnx


my_checkpoint = "path_to_checkpoint"
device = torch.device("cuda:0")
model = onnx.ORTModelForCausalLM.from_pretrained(
        my_checkpoint,
        provider="CUDAExecutionProvider",
        export=True,
        trust_remote_code=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained(my_checkpoint)
pipeline = transformers.TextGenerationPipeline(tokenizer=tokenizer, model=model, device=device)
input = "some_input_text_to_the_model"
pipeline(input, num_workers=0, batch_size=1, num_return_sequences=5, num_beams=5)

Expected behavior

Successful generation of predictions after call to pipeline. Instead I get the error

Tuple object has no attribute Size

Oct 23 '23 14:10 BBerabi

I created a git issue on transformers side with more stack trace. There are some speculation around root cause.

Feb 15 '24 11:02 lidingsnyk

@BBerabi @lidingsnyk Thank you for the details & apology for the late reply, it should be fixed by https://github.com/huggingface/optimum/pull/1722

import torch
import transformers
from optimum.onnxruntime import ORTModelForCausalLM

my_checkpoint = "hf-internal-testing/tiny-random-GPTBigCodeModel"
model = ORTModelForCausalLM.from_pretrained(
        my_checkpoint,
        export=True,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(my_checkpoint)
pipeline = transformers.TextGenerationPipeline(tokenizer=tokenizer, model=model)
input = "some_input"
pipeline(input, num_workers=0, batch_size=1, num_return_sequences=5, num_beams=5, max_new_tokens=5)

now works as expected (dummy model) with the above fix.

Feb 26 '24 14:02 fxmarty

@fxmarty Thanks! Would love to verify this once it's merged. (I see it's possible to install from the source)

Feb 26 '24 16:02 lidingsnyk

@fxmarty Thanks a lot for the fix! We are looking forward to it! :tada:

Feb 27 '24 12:02 BBerabi

optimum optimum copied to clipboard

Running inference pipeline with Starcoderbase model with ONNX Optimization crashes

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

optimum
optimum copied to clipboard