optimum unexpect behavior GPU runtime with ORTModelForSeq2SeqLM

unexpect behavior GPU runtime with ORTModelForSeq2SeqLM

Open tranmanhdat opened this issue 2 years ago • 0 comments

System Info

OS: Ubuntu 20.04.4 LTS
CARD: RTX 3080

Libs:
python 3.10.4
onnx==1.12.0
onnxruntime-gpu==1.12.1
torch==1.12.1
transformers==4.21.2

Who can help?

@lewtun @michaelbenayoun @JingyaHuang @echarlaix

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Steps to reproceduce the behavior:

Convert a public translation from here: vinai-translate-en2vi

from optimum.onnxruntime import ORTModelForSeq2SeqLM
save_directory = "models/en2vi_onnx"
# Load a model from transformers and export it through the ONNX format
model = ORTModelForSeq2SeqLM.from_pretrained('vinai/vinai-translate-en2vi', from_transformers=True)
# Save the onnx model and tokenizer
model.save_pretrained(save_directory)

Load model with modified from example of origin creater model

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import torch
import time
device = "cuda:0" if torch.cuda.is_available() else "cpu"
tokenizer_en2vi = AutoTokenizer.from_pretrained("vinai/vinai-translate-en2vi", src_lang="en_XX")
model_en2vi = ORTModelForSeq2SeqLM.from_pretrained("models/en2vi_onnx")
model_en2vi.to(device)

# onnx_en2vi = pipeline("translation_en_to_vi", model=model_en2vi, tokenizer=tokenizer_en2vi, device=0)
# en_text = '''It's very cold to go out.'''
# start = time.time()
# outpt = onnx_en2vi(en_text)
# end = time.time()
# print(outpt)
# print("time: ", end - start)

def translate_en2vi(en_text: str) -> str:
    start = time.time()
    input_ids = tokenizer_en2vi(en_text, return_tensors="pt").input_ids.to(device)
    end = time.time()
    print("Tokenize time: {:.2f}s".format(end - start))
    # print(input_ids.shape)
    # print(input_ids)
    start = time.time()
    output_ids = model_en2vi.generate(
        input_ids,
        do_sample=True,
        top_k=100,
        top_p=0.8,
        decoder_start_token_id=tokenizer_en2vi.lang_code_to_id["vi_VN"],
        num_return_sequences=1,
    )
    end = time.time()
    print("Generate time: {:.2f}s".format(end - start))
    vi_text = tokenizer_en2vi.batch_decode(output_ids, skip_special_tokens=True)
    vi_text = " ".join(vi_text)
    return vi_text

en_text = '''It's very cold to go out.''' # long paragraph 

start = time.time()
result = translate_en2vi(en_text)
print(result)
end = time.time()
print('{:.2f} seconds'.format((end - start)))

I change line 167 in optimum/onnxruntime/utils.py to return "CUDAExecutionProvider" to run with GPU instead of an error. 3. run example of origin creater model with gpu and compare runtimes

Expected behavior

The onnx model was expected run faster the result is unexpected:

Runtime origin model with gpu is 3-5s while take about 3.5GB GPU
Runtime onnx converted model with gpu is 70-80s while take about 7.7GB GPU

Aug 26 '22 02:08 tranmanhdat

optimum optimum copied to clipboard

unexpect behavior GPU runtime with ORTModelForSeq2SeqLM

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

optimum
optimum copied to clipboard