optimum
optimum copied to clipboard
unexpect behavior GPU runtime with ORTModelForSeq2SeqLM
System Info
OS: Ubuntu 20.04.4 LTS
CARD: RTX 3080
Libs:
python 3.10.4
onnx==1.12.0
onnxruntime-gpu==1.12.1
torch==1.12.1
transformers==4.21.2
Who can help?
@lewtun @michaelbenayoun @JingyaHuang @echarlaix
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Steps to reproceduce the behavior:
- Convert a public translation from here: vinai-translate-en2vi
from optimum.onnxruntime import ORTModelForSeq2SeqLM
save_directory = "models/en2vi_onnx"
# Load a model from transformers and export it through the ONNX format
model = ORTModelForSeq2SeqLM.from_pretrained('vinai/vinai-translate-en2vi', from_transformers=True)
# Save the onnx model and tokenizer
model.save_pretrained(save_directory)
- Load model with modified from example of origin creater model
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import torch
import time
device = "cuda:0" if torch.cuda.is_available() else "cpu"
tokenizer_en2vi = AutoTokenizer.from_pretrained("vinai/vinai-translate-en2vi", src_lang="en_XX")
model_en2vi = ORTModelForSeq2SeqLM.from_pretrained("models/en2vi_onnx")
model_en2vi.to(device)
# onnx_en2vi = pipeline("translation_en_to_vi", model=model_en2vi, tokenizer=tokenizer_en2vi, device=0)
# en_text = '''It's very cold to go out.'''
# start = time.time()
# outpt = onnx_en2vi(en_text)
# end = time.time()
# print(outpt)
# print("time: ", end - start)
def translate_en2vi(en_text: str) -> str:
start = time.time()
input_ids = tokenizer_en2vi(en_text, return_tensors="pt").input_ids.to(device)
end = time.time()
print("Tokenize time: {:.2f}s".format(end - start))
# print(input_ids.shape)
# print(input_ids)
start = time.time()
output_ids = model_en2vi.generate(
input_ids,
do_sample=True,
top_k=100,
top_p=0.8,
decoder_start_token_id=tokenizer_en2vi.lang_code_to_id["vi_VN"],
num_return_sequences=1,
)
end = time.time()
print("Generate time: {:.2f}s".format(end - start))
vi_text = tokenizer_en2vi.batch_decode(output_ids, skip_special_tokens=True)
vi_text = " ".join(vi_text)
return vi_text
en_text = '''It's very cold to go out.''' # long paragraph
start = time.time()
result = translate_en2vi(en_text)
print(result)
end = time.time()
print('{:.2f} seconds'.format((end - start)))
I change line 167 in optimum/onnxruntime/utils.py to return "CUDAExecutionProvider" to run with GPU instead of an error. 3. run example of origin creater model with gpu and compare runtimes
Expected behavior
The onnx model was expected run faster the result is unexpected:
- Runtime origin model with gpu is 3-5s while take about 3.5GB GPU
- Runtime onnx converted model with gpu is 70-80s while take about 7.7GB GPU