4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake

Open xujuntwt95329 opened this issue 1 year ago • 1 comments

XNNPACK by default uses 5x16 fp32-gemm kernel for x86_fma3, but we found that 4x16s4 kernel shows better performance on meteor lake CPU (Intel(R) Core(TM) Ultra 7 155H)

benchmark	5x16 (us)	4x16s4 (us)	Reduction on inference time (%)
FP32MobileNetV1/T:1/real_time	16193	10775	33.46
FP32MobileNetV2/T:1/real_time	8809	6626	24.78
FP32MobileNetV3Large/T:1/real_time	7756	6052	21.97
FP32MobileNetV3Small/T:1/real_time	2180	1970	9.63

Here is the code to reproduce the above data: https://github.com/xujuntwt95329/XNNPACK/tree/0143aab98634c866b319decca52590e1eb54b9dd

We can submit PR if this is welcome.

May 27 '24 15:05 xujuntwt95329

Note that this is due to Visual C register spill. clang produces better code with 5x16

Jun 25 '24 06:06 fbarchard