optimum
optimum copied to clipboard
Running inference pipeline with Starcoderbase model with ONNX Optimization crashes
System Info
Optimum Version: 1.13.2
Platform: Ubuntu 22.04
Python Version: 3.10.2
Transformers Version: 4.34
Who can help?
@JingyaHuang @fxmarty @michaelbenayoun
Running inference pipeline with onnx optimized starcoder model or any model with multi query attention crashes. I have written very detailed comments in PR https://github.com/huggingface/optimum/pull/1042 which was meant to add support for this kind of models but I think the code is not safe/robust enough. I highlighted all of the potentially problematic cases and bugs. Please have a look at the comments there.
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
import torch
import transformers
import onnx.runtime as onnx
my_checkpoint = "path_to_checkpoint"
device = torch.device("cuda:0")
model = onnx.ORTModelForCausalLM.from_pretrained(
my_checkpoint,
provider="CUDAExecutionProvider",
export=True,
trust_remote_code=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained(my_checkpoint)
pipeline = transformers.TextGenerationPipeline(tokenizer=tokenizer, model=model, device=device)
input = "some_input_text_to_the_model"
pipeline(input, num_workers=0, batch_size=1, num_return_sequences=5, num_beams=5)
Expected behavior
Successful generation of predictions after call to pipeline. Instead I get the error
Tuple object has no attribute Size
I created a git issue on transformers side with more stack trace. There are some speculation around root cause.
@BBerabi @lidingsnyk Thank you for the details & apology for the late reply, it should be fixed by https://github.com/huggingface/optimum/pull/1722
import torch
import transformers
from optimum.onnxruntime import ORTModelForCausalLM
my_checkpoint = "hf-internal-testing/tiny-random-GPTBigCodeModel"
model = ORTModelForCausalLM.from_pretrained(
my_checkpoint,
export=True,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(my_checkpoint)
pipeline = transformers.TextGenerationPipeline(tokenizer=tokenizer, model=model)
input = "some_input"
pipeline(input, num_workers=0, batch_size=1, num_return_sequences=5, num_beams=5, max_new_tokens=5)
now works as expected (dummy model) with the above fix.
@fxmarty Thanks! Would love to verify this once it's merged. (I see it's possible to install from the source)
@fxmarty Thanks a lot for the fix! We are looking forward to it! :tada: