TurboTransformers
TurboTransformers copied to clipboard
What is variable-length and comparison with onnxruntime.
Could you explain a little bit more of the support of variable-length? Does it mean the runtime can support inputs with different sequences in a single session, like [batch, 8], [batch, 32] and etc.? Or it actually can support difference sequence length in one input, e.g., for an input with batch size of 2 like below, it runs sequence 128 for the first input, and sequence 3 for the second input for better performance? [ [1,2,3,4, ...,128], [5,6,7], ]
Variable-length indicates turbo can support inputs with different shapes. You can feed it with a stream like [1, 10], [1,15], [2, 30], ... No padding and truncation are required. In the second case, when you input two sequences, of which the length is 3 and 128, a better way is to split it into two independent inferences. A batch scheduler can be customized for your serving process based on turbo.
Thanks! If so, onnxruntime also support the variable-length you mean here. You can add dynamic_axes in the torch.onnx.export [https://github.com/Tencent/TurboTransformers/blob/f2d66bc12f0b904328372f472f6379aba50007cc/benchmark/benchmark_helper.py#L92]. The API doc is here: [https://pytorch.org/docs/stable/onnx.html#torch.onnx.export]
Cool, onnxrt really did a very good job.
Thanks! Could you please update your table after your verification? And I'm curious why you use onnxruntime-mkldnn over the default with mlas. Do you see a better performance with it?
I will update the performance.
We built onnxrt v1.0.0 with the following command and benchmark it with tow different backends, mkldnn
and cpu
.
./build.sh --config=Release --update --build --build_wheel --use_mkldnn --use_mklml --parallel
After we found onnxrt updated, we built onnxrt v1.2.0 with the following command and benchmark it with cpu
as backend.
./build.sh --config=Release --update --build --build_wheel --use_mklml --parallel
We presented the best results we observed.
In fact, I noticed onnxruntime achieved the best results on AuthenticAMD CPU recently. However, on Intel CPU 6133, a customized CPU widely deployed in Tencent, I did not observe the same result.
In Tencent, onnxrt has been used in many online serving seniors. We appreciate it if you could bring some insights to use onnxrt more wisely.
I have compared onnxruntime performance dynamic axis vs fixed axis. When using 8 threads, the dynamic axis introduces significant performance degradation.
BTW, the performance figures illustrated in README did not consider the impact of variable seq length. It just averages the time of running the same model 150 times.
@feifeibear, some models with dynamic inputs can not be fused at runtime. Could you try this offline tool to optimize the model before running it and see if the performances between the dynamice_axis and fixed are same? https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert#model-optimization
After onnxruntime is upgraded to v1.4.0, Turbo starts to use onnxruntime as default backend for CPU, which has fully met our needs.
It's great! We will keep improving the performance. We also support quantization for transformer-based models on CPU now.
Thanks yufeng, Could you please give me some references on quantization?