Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

Jan 24 '24 11:01 hayyaw

Hi @hayyaw , what do you mean by runs slower than before? please elaborate. what was the previous baseline and how much slower?

Jan 24 '24 14:01 snadampal

test command: ./test-backend-ops perf -o MUL_MAT case: MUL_MAT(type_a=f16,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]) ggml_vec_dot_f16 performance: 6 runs - 34869.83 us/run - 1647128 kB/run - 45.05 GB/s then compile with ARM_FEATURE_FP16_VECTOR_ARITHMETIC, performance becomes slower as followings: 6 runs - 175005.83 us/run - 1647128 kB/run - 8.98 GB/s

Jan 25 '24 02:01 hayyaw

I encountered the same problems on Android. it seems like enable __ARM_FEATURE_FP16_VECTOR_ARITHMETIC option would slower than without. it's perf as belows.

background

We are trying to optimze MUL_MAT perf on Arm chips;
According to the ggml.c code, if defined __ARM_FEATURE_FP16_VECTOR_ARITHMETIC, the GGML_F16 relative macro definitions would map to f16 neon intrinsics, such as GGML_F16x8_FMA map to vfmaq_f16. if undefined, those GGML_F16 relative definitions would map to f32 neon intrinsics. So it seems like to enable __ARM_FEATURE_FP16_VECTOR_ARITHMETIC would enhance the perf on arm chip.
So we enabel it by add add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) in CMakeList.txt, detail modifition as belows:

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -634,6 +634,7 @@ if ((${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm") OR (${CMAKE_SYSTEM_PROCESSOR} MATC
         # add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) # MSVC doesn't support vdupq_n_f16, vld1q_f16, vst1q_f16
         add_compile_definitions(__aarch64__) # MSVC defines _M_ARM64 instead
     else()
+        add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
         check_cxx_compiler_flag(-mfp16-format=ieee COMPILER_SUPPORTS_FP16_FORMAT_I3E)
         if (NOT "${COMPILER_SUPPORTS_FP16_FORMAT_I3E}" STREQUAL "")

un-expected results

those GGML_F16 macro definitions mainly affect the perf of ggml_vec_dot_f16, so we bench MUL_MAT perf with atype=f16, btype=f16/f32 by test-backend-ops;
the result is sad. the vec_dot_f16 perf with __ARM_FEATURE_FP16_VECTOR_ARITHMETIC is slower servely than without, the detail as below shows.
the __ARM_FEATURE_FP16_VECTOR_ARITHMETIC option is OFF by default. Why hasn't enable that ? Why enable it would decrease the perf heavily？@snadampal @ggerganov looking forward to your reply, Thanks.

how to re-produce

hardware

OnePlus 9 with Snapdragon 888 chip

build

after add add_compile_definitions(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) in CMakeList.txt, build with belows script

cd examples/llava/android/build_64
../build_64.sh -DLLAMA_PERF=0

bench script

note: I modified the dtype, m,n,k as below shows.

./test-backend-ops perf -o MUL_MAT

the basic perf

MUL_MAT(type_a=f32,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]):                  4 runs - 16024.25 us/run -  2195992 kB/run -  130.69 GB/s
  MUL_MAT(type_a=f16,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]):                  6 runs - 35947.83 us/run -  1647128 kB/run -   43.70 GB/s
  MUL_MAT(type_a=f16,type_b=f16,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]):                  8 runs - 34055.50 us/run -  1098264 kB/run -   30.76 GB/s

the perf when enable __ARM_FEATURE_FP16_VECTOR_ARITHMETIC option ON

MUL_MAT(type_a=f32,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]):                  4 runs - 16813.25 us/run -  2195992 kB/run -  124.56 GB/s
  MUL_MAT(type_a=f16,type_b=f32,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]):                  6 runs - 174329.17 us/run -  1647128 kB/run -    9.01 GB/s
  MUL_MAT(type_a=f16,type_b=f16,m=2048,n=67,k=2048,bs=[1,1],nr=[1,1]):                  8 runs - 172306.88 us/run -  1098264 kB/run -    6.08 GB/s

Jan 25 '24 12:01 XiaotaoChen

You shouldn't need to set this macro directly, this is set by the compiler when targeting an architecture that has FP16. If the compiler isn't setting it, it's because it thinks the target doesn't have the feature, and setting the macro explicitly is likely to cause problems.

Normally this would be set as a result of -mcpu=native when the compiler detects you're compiling for a CPU with FP16. Building native on an Arm Linux system, I see ggml_vec_dot_f16 using FP16 instructions. Are you cross-building for Android from an x86 host? You might need to carefully check your compiler options and use suitable -march or -mcpu options. But in general it's not a good idea to override the feature macros.

Feb 06 '24 07:02 algr

You shouldn't need to set this macro directly, this is set by the compiler when targeting an architecture that has FP16. If the compiler isn't setting it, it's because it thinks the target doesn't have the feature, and setting the macro explicitly is likely to cause problems.

Normally this would be set as a result of -mcpu=native when the compiler detects you're compiling for a CPU with FP16. Building native on an Arm Linux system, I see ggml_vec_dot_f16 using FP16 instructions. Are you cross-building for Android from an x86 host? You might need to carefully check your compiler options and use suitable -march or -mcpu options. But in general it's not a good idea to override the feature macros.

yeah, We cross-build for Android on x86 host. Thanks for your reminder.

Feb 20 '24 06:02 XiaotaoChen

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 06 '24 01:04 github-actions[bot]

llama.cpp
llama.cpp copied to clipboard

ggml_vec_dot_f16's perf is slower servely when enable ARM_FEATURE_FP16_VECTOR_ARITHMETIC on Android

Prerequisites

Feature Description

Motivation

Possible Implementation

background

un-expected results

how to re-produce

hardware

build

bench script

the basic perf

the perf when enable __ARM_FEATURE_FP16_VECTOR_ARITHMETIC option ON

llama.cpp llama.cpp copied to clipboard

ggml_vec_dot_f16's perf is slower servely when enable ARM_FEATURE_FP16_VECTOR_ARITHMETIC on Android

Prerequisites

Feature Description

Motivation

Possible Implementation

background

un-expected results

how to re-produce

hardware

build

bench script

the basic perf

the perf when enable __ARM_FEATURE_FP16_VECTOR_ARITHMETIC option ON

llama.cpp
llama.cpp copied to clipboard