onnxruntime Enable AVX NE CONVERT for FP16 to FP32 cast

Description

Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT instructions to accelerate casting from FP16 to FP32. Added CPUID checks to determine support of the ISA.

Motivation and Context

Currently FP16 models executed on systems that lack complete FP16 operator support use single precision on every node to run the model, this means the original FP16 weights have to be casted to FP32 in order to run the model properly, this change aims to accelerate the casting by using upconvert instructions and therefore improve performance.

Jun 26 '24 18:06 eralmual

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

Jun 26 '24 22:06 tianleiwu

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

Jun 26 '24 22:06 tianleiwu

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Jun 26 '24 22:06 tianleiwu

Azure Pipelines successfully started running 3 pipeline(s).

Jun 26 '24 22:06 azure-pipelines[bot]

Azure Pipelines successfully started running 10 pipeline(s).

Jun 26 '24 22:06 azure-pipelines[bot]

Azure Pipelines successfully started running 10 pipeline(s).

Jun 26 '24 22:06 azure-pipelines[bot]

i think the build failure of QNN CI pipeline is that it uses msvc 14.36, which doesn't support vcvtneeph2ps instruction yet. Other windows CI pipeline uses 14.40.

@snnn, any ideas why QNN CI pipeline doesn't use same msvc version?

Jun 27 '24 17:06 yufenglee

Hi @yufenglee @tianleiwu! Do you have any other feedback of the PR?

Jul 12 '24 15:07 eralmual

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

Jul 12 '24 18:07 tianleiwu

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

Jul 12 '24 18:07 tianleiwu

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Jul 12 '24 18:07 tianleiwu

Azure Pipelines successfully started running 10 pipeline(s).

Jul 12 '24 18:07 azure-pipelines[bot]

Azure Pipelines successfully started running 3 pipeline(s).

Jul 12 '24 18:07 azure-pipelines[bot]

Azure Pipelines successfully started running 10 pipeline(s).

Jul 12 '24 18:07 azure-pipelines[bot]

@eralmual, some build pipeline failed, need to fix the build. https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1431587&view=logs&j=9d16baec-2ed2-55b0-74fb-c50315f92eff&t=39997a68-8fc6-587d-198c-e5d495a0b19a&l=1126 gcc 11.4 build errror: /onnxruntime_src/onnxruntime/core/mlas/lib/x86_64/cvtfp16a.S:44: Error: no such instruction: `vcvtneeph2ps ymm0,ymmword PTR [rdi]'

Could you add some conditional compilation to make sure cvtfp16a.S is not compiled when compiler not support vcvtneeph2ps?

Jul 12 '24 18:07 tianleiwu

@tianleiwu @yufenglee since the new and the old .asm implementation is now on the same file (as per the request to fuse both implementations on the same file), doing a compiler check to include that file would lock both versions, do you want me to get the two functions separate again so we can use the check without affecting the old version?

Jul 19 '24 22:07 eralmual

@eralmual, the solution is either to sperate to a new file and only compile the file when compiler support it; or add some #if macro check in .asm source file to conditionally compile some code block, the macro can be a check of compiler name and version (like #ifdef _MSC_VER), or check whether some custom defined build flag (like USE_AVX_NE_CONVERT) exists.

From the pipeline builds, it seems that it only supported by compiler in Windows. Did you try build it in Linux?

Jul 20 '24 21:07 tianleiwu

Hi @tianleiwu could you run the pipeline again please

Jul 26 '24 15:07 eralmual

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

Jul 27 '24 02:07 tianleiwu

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

Jul 27 '24 02:07 tianleiwu

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Jul 27 '24 02:07 tianleiwu

Azure Pipelines successfully started running 3 pipeline(s).

Jul 27 '24 02:07 azure-pipelines[bot]

Azure Pipelines successfully started running 9 pipeline(s).

Jul 27 '24 02:07 azure-pipelines[bot]

Azure Pipelines successfully started running 10 pipeline(s).

Jul 27 '24 02:07 azure-pipelines[bot]

There is build error in iOS and Andriod. Try exclude it from those platforms like

#if !defined(__APPLE__) && !defined(__ANDROID__)

Jul 27 '24 18:07 tianleiwu

@tianleiwu could you try again?

Jul 30 '24 16:07 eralmual

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

Aug 03 '24 01:08 tianleiwu

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

Aug 03 '24 01:08 tianleiwu

Azure Pipelines successfully started running 9 pipeline(s).

Aug 03 '24 01:08 azure-pipelines[bot]

Azure Pipelines successfully started running 10 pipeline(s).

Aug 03 '24 01:08 azure-pipelines[bot]

onnxruntime onnxruntime copied to clipboard

Enable AVX NE CONVERT for FP16 to FP32 cast

Description

Motivation and Context

onnxruntime
onnxruntime copied to clipboard