XNNPACK
XNNPACK copied to clipboard
4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake
XNNPACK by default uses 5x16 fp32-gemm kernel for x86_fma3, but we found that 4x16s4 kernel shows better performance on meteor lake CPU (Intel(R) Core(TM) Ultra 7 155H)
| benchmark | 5x16 (us) | 4x16s4 (us) | Reduction on inference time (%) |
|---|---|---|---|
| FP32MobileNetV1/T:1/real_time | 16193 | 10775 | 33.46 |
| FP32MobileNetV2/T:1/real_time | 8809 | 6626 | 24.78 |
| FP32MobileNetV3Large/T:1/real_time | 7756 | 6052 | 21.97 |
| FP32MobileNetV3Small/T:1/real_time | 2180 | 1970 | 9.63 |
Here is the code to reproduce the above data: https://github.com/xujuntwt95329/XNNPACK/tree/0143aab98634c866b319decca52590e1eb54b9dd
We can submit PR if this is welcome.
Note that this is due to Visual C register spill. clang produces better code with 5x16