optimum ONNX converted seq2seq model cannot batch decoding

System Info

OS: Linux
GPU: T4
python: 3.10

I converted hugginface torch VisionEncoderDecoderModel to ONNX runtime. I expected that it generate sentences more faster. The converted model can generate one sample more faster than unconverted model, but it seems cannot process batch decoding.

for example, if I increase batch size to 32, it takes 13 seconds. But original model takes only 4 seconds.

# original model
for i in range(1, 33):
  start = time.time()
  result = model.generate(pixel_values[:i].to('cuda'),
                                  use_cache=True,
                                  max_length=512)
  print(i, time.time()-start)

1 1.9602065086364746
2 2.2361714839935303
3 1.7172563076019287
4 2.9478509426116943
5 3.0988991260528564
6 2.3670575618743896
7 3.1197562217712402
8 2.704432964324951
9 2.323518753051758
10 2.2599034309387207
11 2.767979383468628
12 3.1968564987182617
13 3.7980499267578125
14 3.1969969272613525
15 3.4232337474823
16 3.5718371868133545
17 3.9069926738739014
18 3.355004072189331
19 3.3892338275909424
20 3.3768441677093506
21 3.7819128036499023
22 3.5842056274414062
23 3.311136245727539
24 3.345400810241699
25 3.6312100887298584
26 3.7876055240631104
27 3.367159843444824
28 3.330977201461792
29 3.503051996231079
30 3.994265079498291
31 3.340162754058838
32 3.370908737182617

# converted model
for i in range(1, 33):
  start = time.time()
  result = ort_opt_model.generate(pixel_values[:i].to('cuda'),
                                  use_cache=True,
                                  max_length=512)
  print(i, time.time()-start)

1 0.70778489112854
2 0.9595208168029785
3 1.368959903717041
4 2.818580150604248
5 3.7165281772613525
6 5.1135454177856445
7 4.922013759613037
8 5.482929468154907
9 6.671945333480835
10 7.82289457321167
11 9.690032958984375
12 10.405744791030884
13 10.668415069580078
14 11.835261583328247
15 12.555593013763428
16 13.288154602050781
17 13.34635877609253
18 13.184569835662842
19 13.053982734680176
20 13.477515935897827
21 13.620061159133911
22 13.325799942016602
23 13.294811487197876
24 13.34472131729126
25 13.32916522026062
26 13.289781332015991
27 13.05787992477417
28 12.993014097213745
29 13.116510152816772
30 13.305957078933716
31 13.172485113143921
32 13.270084381103516

The converting code is here:

model_checkpoint = "./checkpoint"
save_directory = "./onnx/"
# Load a model from transformers and export it to ONNX
ort_model = ORTModelForVision2Seq.from_pretrained(model_checkpoint,
                                                  export=True,
                                                  use_io_binding=True,
                                                  use_cache=True,
                                                  provider="CUDAExecutionProvider")
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_checkpoint)
# Save the onnx model and tokenizer
ort_model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

# Optimizing
# optimization
from onnxruntime.transformers import optimizer

optimized_model = optimizer.optimize_model('./onnx/decoder_model.onnx', model_type='gpt2', num_heads=6, hidden_size=384, use_gpu=True, opt_level=99)
# optimized_model.convert_float_to_float16()
optimized_model.save_model_to_file("./onnx_opt/decoder_model.onnx")

optimized_model = optimizer.optimize_model('./onnx/decoder_with_past_model.onnx', model_type='gpt2', num_heads=6, hidden_size=384, use_gpu=True, opt_level=99)
# optimized_model.convert_float_to_float16()
optimized_model.save_model_to_file("./onnx_opt/decoder_with_past_model.onnx")

optimized_model = optimizer.optimize_model('./onnx/encoder_model.onnx', use_gpu=True, opt_level=99)
# optimized_model.convert_float_to_float16()
optimized_model.save_model_to_file("./onnx_opt/encoder_model.onnx")

And model loading code is here:

# original model
from transformers import VisionEncoderDecoderModel
model = VisionEncoderDecoderModel.from_pretrained('./checkpoint',
                                                  torch_dtype='auto',
                                                  ).to('cuda')

# converted model
ort_opt_model = ORTModelForVision2Seq.from_pretrained(
    model_id = './onnx',
    use_io_binding=False,
    encoder_session='./onnx_opt/encoder_model.onnx',
    decoder_session = './onnx_opt/decoder_model.onnx',
    decoder_with_past_session='./onnx_opt/decoder_with_past_model.onnx',
    generation_config='./onnx/generation_config.json',
    provider="CUDAExecutionProvider",
    use_cache=True,
    # use_cache_branch = True,
    # session_options=session_options
).to('cuda')

Who can help?

@JingyaHuang @echarlaix @philschmid

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Convert torch model to onnx runtime.
Put the batch input into ort_model.generate()
It seems like doesn't process batch.

Expected behavior

Batch decoding

Jul 21 '23 01:07 kyle-bong

Thank you, is this issue about speed or logits matching with pytorch?

For speed, I'm quite sure IO Binding would help. By the way

# converted model
ort_opt_model = ORTModelForVision2Seq.from_pretrained('path/to/model', use_io_binding=False, provider="CUDAExecutionProvider")

is enough to load the model.

Have you tried without using KV cache if there's the same issue? And on CPU device?

Jul 21 '23 11:07 fxmarty

Yes. It is issue about speed matching with pytorch. ORT model is slower than pytorch model. I had tried without suing KV cache. Regardless of using KV cache or not, and regardless of using cpu or gpu, ort model doesn't process batch decoding.

Jul 24 '23 00:07 kyle-bong

Thank you, is this issue about speed or logits matching with pytorch?

For speed, I'm quite sure IO Binding would help. By the way
# converted model
ort_opt_model = ORTModelForVision2Seq.from_pretrained('path/to/model', use_io_binding=False, provider="CUDAExecutionProvider")
is enough to load the model.

Have you tried without using KV cache if there's the same issue? And on CPU device?

is there exist a way to decress inference time using onnx ?, i'm trying with TrOCR model

May 16 '24 19:05 CrasCris

optimum optimum copied to clipboard

ONNX converted seq2seq model cannot batch decoding

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

optimum
optimum copied to clipboard