llama.cpp Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 - better support for different x86

When I compile with make, the following error occurs

inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’: target specific option mismatch
   52 | _mm256_cvtph_ps (__m128i __A)

Error will be reported when executing cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -msse3 -c ggml.c -o ggml.o . But the error of executing cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -msse3 -c ggml.c -o ggml.o will not occur. Must -mavx be used with -mf16c?

OS: Arch Linux x86_64 Kernel: 6.1.18-1-lts

Mar 16 '23 04:03 xiliuya

Looks like a duplicate of #107. Can you please confirm you're running on native x86_64 and not emulated?

Mar 16 '23 11:03 gjmulder

Looks like a duplicate of #107. Can you please confirm you're running on native x86_64 and not emulated?

Yes, not in a virtual environment such as docker. I will also report an error when I execute cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -msse3 -c ggml.c -o ggml.o on other machines.

Mar 16 '23 12:03 xiliuya

If it is Arch I'm guessing you're using a very recent g++ version. I know it compiles with g++10 under Debian and Ubuntu. We haven't collected data on other g++ versions.

Mar 16 '23 12:03 gjmulder

If it is Arch I'm guessing you're using a very recent g++ version. I know it compiles with g++10 under Debian and Ubuntu. We haven't collected data on other g++ versions.

I tried to execute gcc-10 -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -c ggml.c -o ggml.o using gcc10 on archlinux. Errors may also occur.

usr/lib/gcc/x86_64-pc-linux-gnu/10.4.0/include/f16cintrin.h:52:1: error：inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’: target specific option mismatch
   52 | _mm256_cvtph_ps (__m128i __A)
      | ^~~~~~~~~~~~~~~

The gcc version is as follows:

llama.cpp (master|✔) $ gcc-10 --version
gcc-10 (Arch Linux 10.4.0-1) 10.4.0
Copyright © 2020 Free Software Foundation, Inc.

I'm not sure if this is the reason for archlinux.

Mar 16 '23 12:03 xiliuya

_mm256_cvtph_ps requires the fp16c extension(?) see here

You need to add -mf16c to the build command

Mar 16 '23 12:03 Green-Sky

Are the g++ defaults different then for Arch than for Debian derived distros, or is it something else?

Mar 16 '23 12:03 gjmulder

I am using the provided Makefile, which set those flags for you https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L92

Mar 16 '23 12:03 Green-Sky

Have the same issue using provided Makefile. Ubuntu 22.04 LTS, gcc-11.3.0, Xeon E5 2690

Mar 16 '23 13:03 ReOT20

I believe that CPU supports only AVX, not AVX2. edit: this is wrong, see below ~llama.cpp requires AVX2.~ (The Makefile should probably check for this and emit an error message instead of this random sounding compiler error)

Mar 16 '23 13:03 j-f1

I believe that CPU supports only AVX, not AVX2. llama.cpp requires AVX2.

No, when I execute cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -msse3 -c ggml.c -o ggml.o to generate. o files, run make can run normally.

Mar 16 '23 13:03 xiliuya

Try running:

grep -o 'avx2' /proc/cpuinfo

If it doesn’t print avx2 then AVX2 is not supported.

Mar 16 '23 13:03 j-f1

Try running:
grep -o 'avx2' /proc/cpuinfo
If it doesn’t print avx2 then AVX2 is not supported.

My CPU does not support avx2, but it can run normally through the above method.

Mar 16 '23 13:03 xiliuya

I am using the provided Makefile, which set those flags for you https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L92

I understand, but my CPU does not support FP16C, only avx. Should the makefile be modified?

Mar 16 '23 13:03 xiliuya

We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level.

Mar 16 '23 13:03 Green-Sky

We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level.

I made a patch and can make normally

diff --git a/Makefile b/Makefile
index 1601079..cf4a536 100644
--- a/Makefile
+++ b/Makefile
@@ -90,6 +90,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
                F16C_M := $(shell grep "f16c " /proc/cpuinfo)
                ifneq (,$(findstring f16c,$(F16C_M)))
                        CFLAGS += -mf16c
+               else ifneq (,$(findstring avx,$(AVX1_M)))
+                       CFLAGS := $(filter-out -mavx,$(CFLAGS))
                endif
                SSE3_M := $(shell grep "sse3 " /proc/cpuinfo)
                ifneq (,$(findstring sse3,$(SSE3_M)))

Mar 16 '23 13:03 xiliuya

@gjmulder @xiliuya

I have this issue reported issue on my CPU. Apparently it has AVX, but no F16C (and no AVX2). I have quite old 10-15 year old Intel CPU on laptop.

Probably it is the case that some old CPUs have AVX while having no F16C.

I had this compilation issue on Windows latest 16-th Clang when provided -march=native. As you know arch native tells compiler to use all CPU features of current CPU, and it appears that it provides AVX feature but without F16C feature.

My compilation was fixed and program was working (although not to very fast) after I implemented this conversion functions myself and placed following code inside #elif defined(__AVX__) section of ggml.c:

#if defined(__AVX__) && !defined(__F16C__)
__m256 _mm256_cvtph_ps(__m128i x) {
    ggml_fp16_t const * src = &x;
    float dst[8];
    for (int i = 0; i < 8; ++i)
        dst[i] = GGML_FP16_TO_FP32(src[i]);
    return *(__m256*)&dst;
}
__m128i _mm256_cvtps_ph(__m256 x, int imm) {
    float const * src = &x;
    ggml_fp16_t dst[8];
    for (int i = 0; i < 8; ++i)
        dst[i] = GGML_FP32_TO_FP16(src[i]);
    return *(__m128i*)&dst;
}
#endif

If some C/C++ gurus know faster implementation of this function for AVX then please tell here.

For know suggesting to put fix above into main branch by any volunteer. If code above is alright.

Mar 17 '23 19:03 polkovnikov

It would be great if @xiliuya and @polkovnikov could work together to both create a pull request with your patches so we can support a wider range of CPUs.

Mar 17 '23 20:03 gjmulder

#if defined(AVX) && !defined(F16C) __m256 _mm256_cvtph_ps(__m128i x) { ggml_fp16_t const * src = &x; float dst[8]; for (int i = 0; i < 8; ++i) dst[i] = GGML_FP16_TO_FP32(src[i]); return (__m256)&dst; } __m128i _mm256_cvtps_ph(__m256 x, int imm) { float const * src = &x; ggml_fp16_t dst[8]; for (int i = 0; i < 8; ++i) dst[i] = GGML_FP32_TO_FP16(src[i]); return (__m128i)&dst; } #endif

Yes, it works. But it doesn't perform as well on my machine as not using AVX. Test command: ./main -m ./models/7B/ggml-model-q4_0.bin -p "Hello Bob" -t 15 -n 128 Before adding code:

main: seed = 1679115214
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 15 / 16 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

main: prompt: ' Hello Bob'
main: number of tokens in prompt = 3
     1 -> ''
 15043 -> ' Hello'
  7991 -> ' Bob'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


 Hello Bob. Congratulations on your book "The Buzz About Bees". It is a very interesting read for adults and children alike because it shares lots of information about beekeeping in an entertainment way that makes you want to know more!
I would like congratulate you again, also with the good reviews from Amazon.com customers so far who have commented on your book saying they learned something new or were entertained by it (see below). I am impressed how quickly people started leaving comments when your publisher contacted them to ask for their thoughts after reading this "amazingly interesting"

main: mem per token = 14549044 bytes
main:     load time =  1696.78 ms
main:   sample time =   146.30 ms
main:  predict time = 279482.84 ms / 2149.87 ms per token
main:    total time = 289818.09 ms

After adding code:

main: seed = 1679115582
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 15 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

main: prompt: ' Hello Bob'
main: number of tokens in prompt = 3
     1 -> ''
 15043 -> ' Hello'
  7991 -> ' Bob'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


 Hello Bob! I love your site, and read it every day. But today's article about the H-Sphere seemed a bit confusing to me...
http://www.mightyneato.com/2015_498673926368-1b6a8fbdcbeeebdaaeccd2fe2caecac?utm_source=buffer&amp;utm_content&ltp=Buffer &gtp=Facebook
I would like to ask for clarification on some points: 1. When you say "this

main: mem per token = 14549044 bytes
main:     load time =  1679.01 ms
main:   sample time =   145.62 ms
main:  predict time = 519427.31 ms / 3995.59 ms per token
main:    total time = 537130.75 ms

Mar 18 '23 05:03 xiliuya

@xiliuya You have two different answers for three reasons:

LLaMa uses random SEED value each time you run it, hence may produce different results even on same program.
Your query "Hello Bob" may mean anything, and it looks to me that both answers are good.
Intel CPU intrinsic function from F16C feature has different kinds of rounding. And the one that I use in my code uses generic code, which may do some totally different rounding. This results in difference of last bit of Mantissa of Float-16, which may lead to other answers of neural network.

Please try some more descriptive query, for example instead of "Hello Bob" try query "Building a website can be done in 10 simple steps:". And see if in both cases (before and after adding my code) answers are good enough, i.e. in both cases they really describe building steps of real websites.

Mar 18 '23 11:03 polkovnikov

@xiliuya You have two different answers for three reasons:
1. LLaMa uses random SEED value each time you run it, hence may produce different results even on same program.

2. Your query "Hello Bob" may mean anything, and it looks to me that both answers are good.

3. Intel CPU intrinsic function from F16C feature has different kinds of rounding. And the one that I use in my code uses generic code, which may do some totally different rounding. This results in difference of last bit of Mantissa of Float-16, which may lead to other answers of neural network.
Please try some more descriptive query, for example instead of "Hello Bob" try query "Building a website can be done in 10 simple steps:". And see if in both cases (before and after adding my code) answers are good enough, i.e. in both cases they really describe building steps of real websites.

I am testing and there is no problem with the generated output. The problem is that the generation speed has slowed down since AVX was turned on.

Does turning on AVX really make the output better?

Mar 18 '23 12:03 xiliuya

@xiliuya Turning on AVX changes only speed of computation, but not the quality of answer. Don't know why it happened that AVX version is 1.5 times slower in your case.

Maybe because AVX is not used that much but instead F16C is used more. And in case of my code I emulate F16C through generic algorithms, which are slow. It could be the case that my generic version is slower somehow than generic version of NON-AVX code. It could explain the reason of slow down.

Mar 18 '23 12:03 polkovnikov

@xiliuya Turning on AVX changes only speed of computation, but not the quality of answer. Don't know why it happened that AVX version is 1.5 times slower in your case.

Maybe because AVX is not used that much but instead F16C is used more. And in case of my code I emulate F16C through generic algorithms, which are slow. It could be the case that my generic version is slower somehow than generic version of NON-AVX code. It could explain the reason of slow down.

There is nothing wrong with your code. I tried to turn off the relevant definition and got the same result.

diff --git a/ggml.c b/ggml.c
index 535c7b7..c6cc145 100644
--- a/ggml.c
+++ b/ggml.c
@@ -850,8 +850,9 @@ void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
 
 #elif defined(__AVX__)
 
-#define GGML_SIMD
-
+#if defined(__F16C__)
+    #define GGML_SIMD
+#endif
 // F32 AVX
 
 #define GGML_F32_STEP 32
@@ -908,8 +909,12 @@ void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
 #define GGML_F32Cx8             __m256
 #define GGML_F32Cx8_ZERO        _mm256_setzero_ps()
 #define GGML_F32Cx8_SET1(x)     _mm256_set1_ps(x)
-#define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
-#define GGML_F32Cx8_STORE(x, y) _mm_storeu_si128((__m128i *)(x), _mm256_cvtps_ph(y, 0))
+
+#if  defined(__F16C__)
+    #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
+    #define GGML_F32Cx8_STORE(x, y) _mm_storeu_si128((__m128i *)(x), _mm256_cvtps_ph(y, 0))
+#endif
+
 #define GGML_F32Cx8_FMA         GGML_F32x8_FMA
 #define GGML_F32Cx8_ADD         _mm256_add_ps
 #define GGML_F32Cx8_MUL         _mm256_mul_ps
@@ -918,8 +923,10 @@ void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
 #define GGML_F16_VEC                GGML_F32Cx8
 #define GGML_F16_VEC_ZERO           GGML_F32Cx8_ZERO
 #define GGML_F16_VEC_SET1           GGML_F32Cx8_SET1
-#define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
-#define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx8_STORE(p, r[i])
+#if  defined(__F16C__)
+    #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
+    #define GGML_F16_VEC_STORE(p, r, i) GGML_F32Cx8_STORE(p, r[i])
+#endif
 #define GGML_F16_VEC_FMA            GGML_F32Cx8_FMA
 #define GGML_F16_VEC_ADD            GGML_F32Cx8_ADD
 #define GGML_F16_VEC_MUL            GGML_F32Cx8_MUL

Mar 18 '23 12:03 xiliuya

@xiliuya The problem with your last patch of code is that it TOTALLY removes use of AVX or any other SIMD. Because if you don't define GGML_SIMD macro then only generic non-SIMD code is used everywhere.

But in case of my code I only implement two functions as generic code, while the rest of AVX is still used, like _mm256_mul_ps() function for example.

So if I was to compare my code patch and yours I would choose my version as it should be more faster.

But other experts may disagree, we need more ideas here about ways to improve.

Mar 18 '23 13:03 polkovnikov

Good work guys. I am not a C++ programmer...

I am however interested in performance. I'd ideally want the most performant CPU code for any arch.

Mar 18 '23 14:03 gjmulder

Good work guys. I am not a C++ programmer...

I am however interested in performance. I'd ideally want the most performant CPU code for any arch.

Thank you.

This is a CPU flame diagram in two ways:

Using AVX

main 2. not AVX

main_no

Mar 19 '23 05:03 xiliuya

CFLAGS := $(filter-out -mavx,$(CFLAGS))

We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level.

I made a patch and can make normally

diff --git a/Makefile b/Makefile
index 1601079..cf4a536 100644
--- a/Makefile
+++ b/Makefile
@@ -90,6 +90,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
                F16C_M := $(shell grep "f16c " /proc/cpuinfo)
                ifneq (,$(findstring f16c,$(F16C_M)))
                        CFLAGS += -mf16c
+               else ifneq (,$(findstring avx,$(AVX1_M)))
+                       CFLAGS := $(filter-out -mavx,$(CFLAGS))
                endif
                SSE3_M := $(shell grep "sse3 " /proc/cpuinfo)
                ifneq (,$(findstring sse3,$(SSE3_M)))

This patch allowed me to successfully run the make command.

Mar 23 '23 20:03 tommybutler

I run into the same error when executing make on Ubuntu 22.04 x86_64 within a virtual machine launched by VirtualBox.

In the virtual machine, the content of the proc/cpuinfo file misses the cpu flags: F16C and FMA, but contains AVX, AVX2 and SSE3. Then i found my host machine in fact supports all above flags including F16C and FMA (checked that with coreInfo). So i just modified the Makefile to add the missing flags like this:

diff --git a/Makefile b/Makefile
index 98a2d85..1b0f28c 100644
--- a/Makefile
+++ b/Makefile
@@ -80,6 +80,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
                        CFLAGS += -mavx2
                endif
        else ifeq ($(UNAME_S),Linux)
+               CFLAGS += -mfma
+               CFLAGS += -mf16c
                AVX1_M := $(shell grep "avx " /proc/cpuinfo)
                ifneq (,$(findstring avx,$(AVX1_M)))
                        CFLAGS += -mavx

After that, i can run make without error and execute the compiled result just like bellow: ./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -p 'Hello there'

then got the result:

main: seed = 1679824080
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0


 Hello there and welcome to my personal web-site. I'm a graphic designer, illustrator & photographer based in London, UK. This site is primarily used as an online portfolio showcasing my work but I also use it to blog about what I am doing in my spare time (and the occasional rant!)
I'm available for freelance projects and always open to new opportunities. Please feel free to contact me with any project proposals or enquiries via my details page, email address is there as well as a Twitter link. [end of text]

llama_print_timings:        load time =  4220.08 ms
llama_print_timings:      sample time =    90.68 ms /   117 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =   673.42 ms /     3 tokens (  224.47 ms per token)
llama_print_timings:        eval time = 34385.98 ms /   116 runs   (  296.43 ms per run)
llama_print_timings:       total time = 40892.13 ms

BTW, I also noticed the CMakeLists.txt currently always enable F16C, FMA, AVX and AVX2 for X86 Linux, so i could just use cmake directly:

cmake .
make
cp bin/* ./

that also works well for me.

Mar 26 '23 10:03 beaclnd

@beaclnd92 The problem with your solution that it just enables F16C feature of CPU. But my old CPU has only AVX, but no F16C feature. So your solution works for part of CPUs, but doesn't work for mine.

Mar 26 '23 14:03 polkovnikov

@beaclnd92 The problem with your solution that it just enables F16C feature of CPU. But my old CPU has only AVX, but no F16C feature. So your solution works for part of CPUs, but doesn't work for mine.

Yeah, it maybe only works for a guest virtual machine based on a qualified physical cpu with the required SIMD features. Generally, it gets a better performance compared to not enabling the features.

Mar 27 '23 08:03 beaclnd

hi, I have the same inlining problem when using -mavx , running Linux Mint on i7-2630QM, 16GB RAM, pretty old laptop (13 years), and the problem is I'm not able to get AVX to be used. I know cpu supports it however. I tried @polkovnikov patch, it allows making but still no AVX and prompt reply is really slow. Any idea?

Mar 29 '23 22:03 RiccaDS

Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 - better support for different x86_64 CPU instruction extensions