NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16
Issues:
- Default selected low precision kernel is not optimal for described below platform.
- We have only 30% performance gain for low precision kernel VS fp16 in multithreaded mode. Can you confirm, please, that these are the results you expect? Our expected performance gain was 2x.
Platform:
system_profiler SPHardwareDataType
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: Mac15,6
Chip: Apple M3 Pro
Total Number of Cores: 12 (6 performance and 6 efficiency)
Memory: 18 GB
Operating System:
ProductName: macOS
ProductVersion: 14.2.1
BuildVersion: 23C71
Command line
scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=0 cppthreads=0 os=macos data_layout_support=all build=native asserts=1 --jobs=8 --silent os=macos build=native fixed_format_kernels=True validation_tests=1 examples=1 debug=0
Single thread: cppthreads=0
Multithread: cppthreads=1
Results fp16, default kernel: a64_hybrid_fp16_mla_6x32) Single thread, shapes: 4096x128 * 128x4096
fp16 time median time = 17373 microsecs
Multithread thread, shapes: 4096x128 * 128x4096
fp16 time median time = 2919 microsecs
Results int8, default selected kernel: a64_hybrid_s8s32_mmla_6x16 Single thread, shapes: 4096x128 * 128x4096:
int8 time median time = 12573 microsecs
Multithread, shapes: 4096x128 * 128x4096:
int8 time median time = 3595 microsecs
Results int8, manual selected kernel: a64_interleaved_s8s32_mmla_8x12 Single thread, shapes: 4096x128 * 128x4096:
int8 time median time = 12598 microsecs
Multithread, shapes: 4096x128 * 128x4096:
int8 time median time = 2113 microsecs
Hi @eshoguli
What is the data layout used for these workloads when calling into ACL? It would help if you could build ACL with logging=1 so that we can know more details about these workloads.
you can use eshoguli:es/neon_gemm_s8s8s32_perf_default branch to easily reproduce the issue
INT8 kernels are using MMLA instead of MLAs that FP32 and FP16 using and they work core/memory system much harder. As a result with MMLA we can get anywhere between ~70% to ~80% of speed of light while with MLAs we can get ~95% so that means in terms of absolute performance INT8 should be anywhere between 1.4x to 1.6x faster then FP16 and that is what you are observing in your data (single threaded ratio in your case is 1.38x which is in the right ballpark when taking into account software stack overhead too).