llama.cpp
llama.cpp copied to clipboard
ggml_vec_dot_f16's perf is slower servely when enable ARM_FEATURE_FP16_VECTOR_ARITHMETIC on Android
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp
to do as an enhancement.
Motivation
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp
users.
Possible Implementation
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
Hi @hayyaw , what do you mean by runs slower than before? please elaborate. what was the previous baseline and how much slower?
test command: ./test-backend-ops perf -o MUL_MAT case: MUL_MAT(type_a=f16,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]) ggml_vec_dot_f16 performance: 6 runs - 34869.83 us/run - 1647128 kB/run - 45.05 GB/s then compile with ARM_FEATURE_FP16_VECTOR_ARITHMETIC, performance becomes slower as followings: 6 runs - 175005.83 us/run - 1647128 kB/run - 8.98 GB/s
I encountered the same problems on Android. it seems like enable __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
option would slower than without. it's perf as belows.
background
- We are trying to optimze
MUL_MAT
perf onArm chips
; - According to the
ggml.c
code, ifdefined __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
, theGGML_F16
relative macro definitions would map to f16 neon intrinsics, such asGGML_F16x8_FMA
map tovfmaq_f16
. if undefined, thoseGGML_F16
relative definitions would map to f32 neon intrinsics. So it seems like to enable__ARM_FEATURE_FP16_VECTOR_ARITHMETIC
would enhance the perf onarm chip
. - So we enabel it by add
add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
inCMakeList.txt
, detail modifition as belows:
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -634,6 +634,7 @@ if ((${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm") OR (${CMAKE_SYSTEM_PROCESSOR} MATC
# add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) # MSVC doesn't support vdupq_n_f16, vld1q_f16, vst1q_f16
add_compile_definitions(__aarch64__) # MSVC defines _M_ARM64 instead
else()
+ add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
check_cxx_compiler_flag(-mfp16-format=ieee COMPILER_SUPPORTS_FP16_FORMAT_I3E)
if (NOT "${COMPILER_SUPPORTS_FP16_FORMAT_I3E}" STREQUAL "")
un-expected results
- those
GGML_F16
macro definitions mainly affect the perf ofggml_vec_dot_f16
, so we benchMUL_MAT
perf withatype=f16, btype=f16/f32
bytest-backend-ops
; - the result is sad. the
vec_dot_f16
perf with__ARM_FEATURE_FP16_VECTOR_ARITHMETIC
is slower servely than without, the detail as below shows. - the
__ARM_FEATURE_FP16_VECTOR_ARITHMETIC
option isOFF
by default. Why hasn't enable that ? Why enable it would decrease the perf heavily?@snadampal @ggerganov looking forward to your reply, Thanks.
how to re-produce
hardware
OnePlus 9
with Snapdragon 888
chip
build
after add add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
in CMakeList.txt
, build with belows script
cd examples/llava/android/build_64
../build_64.sh -DLLAMA_PERF=0
bench script
note: I modified the dtype
, m,n,k
as below shows.
./test-backend-ops perf -o MUL_MAT
the basic perf
MUL_MAT(type_a=f32,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]): 4 runs - 16024.25 us/run - 2195992 kB/run - 130.69 GB/s
MUL_MAT(type_a=f16,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]): 6 runs - 35947.83 us/run - 1647128 kB/run - 43.70 GB/s
MUL_MAT(type_a=f16,type_b=f16,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]): 8 runs - 34055.50 us/run - 1098264 kB/run - 30.76 GB/s
the perf when enable __ARM_FEATURE_FP16_VECTOR_ARITHMETIC option ON
MUL_MAT(type_a=f32,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]): 4 runs - 16813.25 us/run - 2195992 kB/run - 124.56 GB/s
MUL_MAT(type_a=f16,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]): 6 runs - 174329.17 us/run - 1647128 kB/run - 9.01 GB/s
MUL_MAT(type_a=f16,type_b=f16,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]): 8 runs - 172306.88 us/run - 1098264 kB/run - 6.08 GB/s
You shouldn't need to set this macro directly, this is set by the compiler when targeting an architecture that has FP16. If the compiler isn't setting it, it's because it thinks the target doesn't have the feature, and setting the macro explicitly is likely to cause problems.
Normally this would be set as a result of -mcpu=native
when the compiler detects you're compiling for a CPU with FP16. Building native on an Arm Linux system, I see ggml_vec_dot_f16 using FP16 instructions. Are you cross-building for Android from an x86 host? You might need to carefully check your compiler options and use suitable -march
or -mcpu
options. But in general it's not a good idea to override the feature macros.
You shouldn't need to set this macro directly, this is set by the compiler when targeting an architecture that has FP16. If the compiler isn't setting it, it's because it thinks the target doesn't have the feature, and setting the macro explicitly is likely to cause problems.
Normally this would be set as a result of
-mcpu=native
when the compiler detects you're compiling for a CPU with FP16. Building native on an Arm Linux system, I see ggml_vec_dot_f16 using FP16 instructions. Are you cross-building for Android from an x86 host? You might need to carefully check your compiler options and use suitable-march
or-mcpu
options. But in general it's not a good idea to override the feature macros.
yeah, We cross-build for Android on x86 host. Thanks for your reminder.
This issue was closed because it has been inactive for 14 days since being marked as stale.