TurboTransformers What is variable-length and comparison with onnxruntime.

Could you explain a little bit more of the support of variable-length? Does it mean the runtime can support inputs with different sequences in a single session, like [batch, 8], [batch, 32] and etc.? Or it actually can support difference sequence length in one input, e.g., for an input with batch size of 2 like below, it runs sequence 128 for the first input, and sequence 3 for the second input for better performance? [ [1,2,3,4, ...,128], [5,6,7], ]

May 03 '20 00:05 yufenglee

Variable-length indicates turbo can support inputs with different shapes. You can feed it with a stream like [1, 10], [1,15], [2, 30], ... No padding and truncation are required. In the second case, when you input two sequences, of which the length is 3 and 128, a better way is to split it into two independent inferences. A batch scheduler can be customized for your serving process based on turbo.

May 03 '20 02:05 feifeibear

Thanks! If so, onnxruntime also support the variable-length you mean here. You can add dynamic_axes in the torch.onnx.export [https://github.com/Tencent/TurboTransformers/blob/f2d66bc12f0b904328372f472f6379aba50007cc/benchmark/benchmark_helper.py#L92]. The API doc is here: [https://pytorch.org/docs/stable/onnx.html#torch.onnx.export]

May 03 '20 04:05 yufenglee

Cool, onnxrt really did a very good job.

May 03 '20 06:05 feifeibear

Thanks! Could you please update your table after your verification? And I'm curious why you use onnxruntime-mkldnn over the default with mlas. Do you see a better performance with it?

May 03 '20 14:05 yufenglee

I will update the performance. We built onnxrt v1.0.0 with the following command and benchmark it with tow different backends, mkldnn and cpu. ./build.sh --config=Release --update --build --build_wheel --use_mkldnn --use_mklml --parallel After we found onnxrt updated, we built onnxrt v1.2.0 with the following command and benchmark it with cpu as backend. ./build.sh --config=Release --update --build --build_wheel --use_mklml --parallel We presented the best results we observed. In fact, I noticed onnxruntime achieved the best results on AuthenticAMD CPU recently. However, on Intel CPU 6133, a customized CPU widely deployed in Tencent, I did not observe the same result.

In Tencent, onnxrt has been used in many online serving seniors. We appreciate it if you could bring some insights to use onnxrt more wisely.

May 04 '20 03:05 feifeibear

I have compared onnxruntime performance dynamic axis vs fixed axis. When using 8 threads, the dynamic axis introduces significant performance degradation.

BTW, the performance figures illustrated in README did not consider the impact of variable seq length. It just averages the time of running the same model 150 times.

May 07 '20 09:05 feifeibear

@feifeibear, some models with dynamic inputs can not be fused at runtime. Could you try this offline tool to optimize the model before running it and see if the performances between the dynamice_axis and fixed are same? https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert#model-optimization

May 11 '20 15:05 yufenglee

After onnxruntime is upgraded to v1.4.0, Turbo starts to use onnxruntime as default backend for CPU, which has fully met our needs.

Jul 23 '20 08:07 feifeibear

It's great! We will keep improving the performance. We also support quantization for transformer-based models on CPU now.

Jul 23 '20 22:07 yufenglee

Thanks yufeng, Could you please give me some references on quantization?

Jul 24 '20 02:07 feifeibear

TurboTransformers TurboTransformers copied to clipboard

What is variable-length and comparison with onnxruntime.

TurboTransformers
TurboTransformers copied to clipboard