TurboTransformers icon indicating copy to clipboard operation
TurboTransformers copied to clipboard

What is variable-length and comparison with onnxruntime.

Open yufenglee opened this issue 4 years ago • 10 comments

Could you explain a little bit more of the support of variable-length? Does it mean the runtime can support inputs with different sequences in a single session, like [batch, 8], [batch, 32] and etc.? Or it actually can support difference sequence length in one input, e.g., for an input with batch size of 2 like below, it runs sequence 128 for the first input, and sequence 3 for the second input for better performance? [ [1,2,3,4, ...,128], [5,6,7], ]

yufenglee avatar May 03 '20 00:05 yufenglee

Variable-length indicates turbo can support inputs with different shapes. You can feed it with a stream like [1, 10], [1,15], [2, 30], ... No padding and truncation are required. In the second case, when you input two sequences, of which the length is 3 and 128, a better way is to split it into two independent inferences. A batch scheduler can be customized for your serving process based on turbo.

feifeibear avatar May 03 '20 02:05 feifeibear

Thanks! If so, onnxruntime also support the variable-length you mean here. You can add dynamic_axes in the torch.onnx.export [https://github.com/Tencent/TurboTransformers/blob/f2d66bc12f0b904328372f472f6379aba50007cc/benchmark/benchmark_helper.py#L92]. The API doc is here: [https://pytorch.org/docs/stable/onnx.html#torch.onnx.export]

yufenglee avatar May 03 '20 04:05 yufenglee

Cool, onnxrt really did a very good job.

feifeibear avatar May 03 '20 06:05 feifeibear

Thanks! Could you please update your table after your verification? And I'm curious why you use onnxruntime-mkldnn over the default with mlas. Do you see a better performance with it?

yufenglee avatar May 03 '20 14:05 yufenglee

I will update the performance. We built onnxrt v1.0.0 with the following command and benchmark it with tow different backends, mkldnn and cpu. ./build.sh --config=Release --update --build --build_wheel --use_mkldnn --use_mklml --parallel After we found onnxrt updated, we built onnxrt v1.2.0 with the following command and benchmark it with cpu as backend. ./build.sh --config=Release --update --build --build_wheel --use_mklml --parallel We presented the best results we observed. In fact, I noticed onnxruntime achieved the best results on AuthenticAMD CPU recently. However, on Intel CPU 6133, a customized CPU widely deployed in Tencent, I did not observe the same result.

In Tencent, onnxrt has been used in many online serving seniors. We appreciate it if you could bring some insights to use onnxrt more wisely.

feifeibear avatar May 04 '20 03:05 feifeibear

I have compared onnxruntime performance dynamic axis vs fixed axis. When using 8 threads, the dynamic axis introduces significant performance degradation.

image

BTW, the performance figures illustrated in README did not consider the impact of variable seq length. It just averages the time of running the same model 150 times.

feifeibear avatar May 07 '20 09:05 feifeibear

@feifeibear, some models with dynamic inputs can not be fused at runtime. Could you try this offline tool to optimize the model before running it and see if the performances between the dynamice_axis and fixed are same? https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert#model-optimization

yufenglee avatar May 11 '20 15:05 yufenglee

After onnxruntime is upgraded to v1.4.0, Turbo starts to use onnxruntime as default backend for CPU, which has fully met our needs.

feifeibear avatar Jul 23 '20 08:07 feifeibear

It's great! We will keep improving the performance. We also support quantization for transformer-based models on CPU now.

yufenglee avatar Jul 23 '20 22:07 yufenglee

Thanks yufeng, Could you please give me some references on quantization?

feifeibear avatar Jul 24 '20 02:07 feifeibear