onnxt5 icon indicating copy to clipboard operation
onnxt5 copied to clipboard

Use OnnxRuntime IO Binding to improve GPU inference performance

Open tianleiwu opened this issue 4 years ago • 3 comments
trafficstars

In current benchmark results, ONNX is slower than PyTorch above 500 words. I think the cause is the OnnxRuntime API used in inference: https://github.com/abelriboulot/onnxt5/blob/284474952bcb10521a0b0132c677f61981ab2a1c/onnxt5/models.py#L121

For GPU inference, that API need extra memory copy (from CPU to GPU for input tensors, and from GPU to CPU for output tensors). When sequence length is large, the IO latency might be significant.

I suggest to try OnnxRuntime IO Binding to avoid extra memory copy.

tianleiwu avatar Dec 08 '20 07:12 tianleiwu

The link at the bottom of the issue is dead, I think the appropriate link is now ONNX Runtime IOBinding.

@tianleiwu did you ever successfully do this?

sam-writer avatar Aug 27 '21 16:08 sam-writer

你好! 请问现在GPU上进行T5的推理,是不是onnxruntime和pytorch的速度都差不多啊?在decode的过程中,每次decode的结果past value都很大,用onnxruntime推理怎么减少IO呢? Hello! Now that the inference of T5 on the GPU is similar, is the speed of onnxruntime and pytorch similar? In the process of decode, the past value of each decode result is large, how to reduce IO with onnxruntime reasoning?@tianleiwu

shiqingzhangCSU avatar Feb 23 '23 11:02 shiqingzhangCSU

@shiqingzhangCSU, To reduce I/O, it need design of special CUDA kernels (and also integrates with BeamSearch operator) to deal with past state. In Onnx Runtime, @wangyems is working on optimizations for T5. It is very close to finish.

You can try out current optimizations (although optimizations are still on-going): https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py

tianleiwu avatar Feb 24 '23 18:02 tianleiwu