ComputeLibrary NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16

Issues:

Default selected low precision kernel is not optimal for described below platform.
We have only 30% performance gain for low precision kernel VS fp16 in multithreaded mode. Can you confirm, please, that these are the results you expect? Our expected performance gain was 2x.

Platform:

system_profiler SPHardwareDataType
   Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac15,6
      Chip: Apple M3 Pro
      Total Number of Cores: 12 (6 performance and 6 efficiency)
      Memory: 18 GB

Operating System:

ProductName:		macOS
ProductVersion:		14.2.1
BuildVersion:		23C71

Command line

scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=0 cppthreads=0 os=macos data_layout_support=all  build=native asserts=1 --jobs=8 --silent os=macos build=native fixed_format_kernels=True validation_tests=1 examples=1 debug=0

Single thread: cppthreads=0 Multithread: cppthreads=1

Results fp16, default kernel: a64_hybrid_fp16_mla_6x32) Single thread, shapes: 4096x128 * 128x4096

fp16 time median time = 17373 microsecs

Multithread thread, shapes: 4096x128 * 128x4096

fp16 time median time = 2919 microsecs

Results int8, default selected kernel: a64_hybrid_s8s32_mmla_6x16 Single thread, shapes: 4096x128 * 128x4096:

int8 time median time = 12573 microsecs

Multithread, shapes: 4096x128 * 128x4096:

int8 time median time = 3595 microsecs

Results int8, manual selected kernel: a64_interleaved_s8s32_mmla_8x12 Single thread, shapes: 4096x128 * 128x4096:

int8 time median time = 12598 microsecs

Multithread, shapes: 4096x128 * 128x4096:

int8 time median time = 2113 microsecs

Aug 01 '24 13:08 eshoguli

Hi @eshoguli

What is the data layout used for these workloads when calling into ACL? It would help if you could build ACL with logging=1 so that we can know more details about these workloads.

Aug 12 '24 10:08 morgolock

you can use eshoguli:es/neon_gemm_s8s8s32_perf_default branch to easily reproduce the issue

Aug 14 '24 12:08 eshoguli

INT8 kernels are using MMLA instead of MLAs that FP32 and FP16 using and they work core/memory system much harder. As a result with MMLA we can get anywhere between ~70% to ~80% of speed of light while with MLAs we can get ~95% so that means in terms of absolute performance INT8 should be anywhere between 1.4x to 1.6x faster then FP16 and that is what you are observing in your data (single threaded ratio in your case is 1.38x which is in the right ballpark when taking into account software stack overhead too).

Feb 19 '25 12:02 milpuz01