XNNPACK
XNNPACK copied to clipboard
add f32-gemm-5x16-minmax-fma3-broadcast-prfm microkernel
Prefetched the weights into the L1 cache in xnn_f32_gemm_minmax_ukernel_5x16__fma3_broadcast
, resulting in an average performance improvement of over 3% across the MobileNet V1/V2/V3_Large/V3_Small models.
----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------------------
f32_gemm_5x16__fma3_broadcast/mobilenet_v1/real_time 11090 us 10237 us 58 <-- orig
f32_gemm_5x16__fma3_broadcast_prfm/mobilenet_v1/real_time 10049 us 10045 us 70 <-- prefetch
----------------------------------------------------------------------------------------------------------
f32_gemm_5x16__fma3_broadcast/mobilenet_v2/real_time 6441 us 6366 us 108 <-- orig
f32_gemm_5x16__fma3_broadcast_prfm/mobilenet_v2/real_time 6085 us 6250 us 115 <-- prefetch
----------------------------------------------------------------------------------------------------------
f32_gemm_5x16__fma3_broadcast/mobilenet_v3_large/real_time 5761 us 5682 us 121 <-- orig
f32_gemm_5x16__fma3_broadcast_prfm/mobilenet_v3_large/real_time 5743 us 5777 us 119 <-- prefetch
----------------------------------------------------------------------------------------------------------
f32_gemm_5x16__fma3_broadcast/mobilenet_v3_small/real_time 1861 us 1833 us 375 <-- orig
f32_gemm_5x16__fma3_broadcast_prfm/mobilenet_v3_small/real_time 1826 us 1810 us 397 <-- prefetch