Yufeng Li
Yufeng Li
ONNX doesn't have a direct quantized tensor definition. Essentially, it uses QDQ to represent a quantized tensor. Thus we can limit the change to Q/DQ operators only as @daquexian and...
Thanks! If so, onnxruntime also support the variable-length you mean here. You can add dynamic_axes in the torch.onnx.export [https://github.com/Tencent/TurboTransformers/blob/f2d66bc12f0b904328372f472f6379aba50007cc/benchmark/benchmark_helper.py#L92]. The API doc is here: [https://pytorch.org/docs/stable/onnx.html#torch.onnx.export]
Thanks! Could you please update your table after your verification? And I'm curious why you use onnxruntime-mkldnn over the default with mlas. Do you see a better performance with it?
@feifeibear, some models with dynamic inputs can not be fused at runtime. Could you try this offline tool to optimize the model before running it and see if the performances...
It's great! We will keep improving the performance. We also support quantization for transformer-based models on CPU now.
The issue was resolved in latest pytorch. Please make sure to use ONNX opset 12 when exporting: https://github.com/pytorch/pytorch/issues/26893
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline,...
/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed