whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Make failure - error: inlining failed in call to always_inline

Open JJGO opened this issue 1 year ago • 13 comments

Thanks for releasing this software, I'm trying to build this in a Debian system but make is failing. The Compiler versions are

❯ cc --version
cc (Debian 8.3.0-6) 8.3.0

❯ c++ --version
c++ (Debian 8.3.0-6) 8.3.0

When trying make base.en in the master branch the following error happens

❯ make base.en
cc  -I.              -O3 -std=c11   -pthread -mavx   -c ggml.c -o ggml.o
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
ggml.c: In function ‘ggml_vec_mad_f16’:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:73:1: error: inlining failed in call to always_inline ‘_mm256_cvtps_ph’: target specific option mismatch
 _mm256_cvtps_ph (__m256 __A, const int __I)
 ^~~~~~~~~~~~~~~
ggml.c:1024:9: note: called from here
         _mm_storeu_si128((__m128i*)(y + i + 24), _mm256_cvtps_ph(y3, 0));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:73:1: error: inlining failed in call to always_inline ‘_mm256_cvtps_ph’: target specific option mismatch
 _mm256_cvtps_ph (__m256 __A, const int __I)
 ^~~~~~~~~~~~~~~
ggml.c:1023:9: note: called from here
         _mm_storeu_si128((__m128i*)(y + i + 16), _mm256_cvtps_ph(y2, 0));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:73:1: error: inlining failed in call to always_inline ‘_mm256_cvtps_ph’: target specific option mismatch
 _mm256_cvtps_ph (__m256 __A, const int __I)
 ^~~~~~~~~~~~~~~
ggml.c:1022:9: note: called from here
         _mm_storeu_si128((__m128i*)(y + i + 8 ), _mm256_cvtps_ph(y1, 0));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:73:1: error: inlining failed in call to always_inline ‘_mm256_cvtps_ph’: target specific option mismatch
 _mm256_cvtps_ph (__m256 __A, const int __I)
 ^~~~~~~~~~~~~~~
ggml.c:1021:9: note: called from here
         _mm_storeu_si128((__m128i*)(y + i + 0 ), _mm256_cvtps_ph(y0, 0));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1014:14: note: called from here
         x3 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(x + i + 24)));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1013:14: note: called from here
         x2 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(x + i + 16)));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1012:14: note: called from here
         x1 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(x + i + 8 )));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1011:14: note: called from here
         x0 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(x + i + 0 )));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1009:14: note: called from here
         y3 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(y + i + 24)));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1008:14: note: called from here
         y2 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(y + i + 16)));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1007:14: note: called from here
         y1 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(y + i + 8 )));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1006:14: note: called from here
         y0 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(y + i + 0 )));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1006:14: note: called from here
         y0 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(y + i + 0 )));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1007:14: note: called from here
         y1 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(y + i + 8 )));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1008:14: note: called from here
         y2 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(y + i + 16)));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1009:14: note: called from here
         y3 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(y + i + 24)));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1011:14: note: called from here
         x0 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(x + i + 0 )));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1012:14: note: called from here
         x1 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(x + i + 8 )));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1013:14: note: called from here
         x2 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(x + i + 16)));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:52:1: error: inlining failed in call to always_inline ‘_mm256_cvtph_ps’: target specific option mismatch
 _mm256_cvtph_ps (__m128i __A)
 ^~~~~~~~~~~~~~~
ggml.c:1014:14: note: called from here
         x3 = _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)(x + i + 24)));
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:73:1: error: inlining failed in call to always_inline ‘_mm256_cvtps_ph’: target specific option mismatch
 _mm256_cvtps_ph (__m256 __A, const int __I)
 ^~~~~~~~~~~~~~~
ggml.c:1021:9: note: called from here
         _mm_storeu_si128((__m128i*)(y + i + 0 ), _mm256_cvtps_ph(y0, 0));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:73:1: error: inlining failed in call to always_inline ‘_mm256_cvtps_ph’: target specific option mismatch
 _mm256_cvtps_ph (__m256 __A, const int __I)
 ^~~~~~~~~~~~~~~
ggml.c:1022:9: note: called from here
         _mm_storeu_si128((__m128i*)(y + i + 8 ), _mm256_cvtps_ph(y1, 0));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:73:1: error: inlining failed in call to always_inline ‘_mm256_cvtps_ph’: target specific option mismatch
 _mm256_cvtps_ph (__m256 __A, const int __I)
 ^~~~~~~~~~~~~~~
ggml.c:1023:9: note: called from here
         _mm_storeu_si128((__m128i*)(y + i + 16), _mm256_cvtps_ph(y2, 0));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/8/include/immintrin.h:99,
                 from ggml.c:128:
/usr/lib/gcc/x86_64-linux-gnu/8/include/f16cintrin.h:73:1: error: inlining failed in call to always_inline ‘_mm256_cvtps_ph’: target specific option mismatch
 _mm256_cvtps_ph (__m256 __A, const int __I)
 ^~~~~~~~~~~~~~~
ggml.c:1024:9: note: called from here
         _mm_storeu_si128((__m128i*)(y + i + 24), _mm256_cvtps_ph(y3, 0));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make: *** [Makefile:127: ggml.o] Error 1

JJGO avatar Nov 28 '22 18:11 JJGO

I commented out the avx lines in the makefile, like this:

diff --git a/Makefile b/Makefile
index 07c91c1..b89d835 100644
--- a/Makefile
+++ b/Makefile
@@ -62,10 +62,10 @@ ifeq ($(UNAME_M),x86_64)
								endif
				endif
				ifeq ($(UNAME_S),Linux)
-               AVX1_M := $(shell grep "avx " /proc/cpuinfo)
-               ifneq (,$(findstring avx,$(AVX1_M)))
-                       CFLAGS += -mavx
-               endif
+               #AVX1_M := $(shell grep "avx " /proc/cpuinfo)
+               #ifneq (,$(findstring avx,$(AVX1_M)))
+               #       CFLAGS += -mavx
+               #endif
								AVX2_M := $(shell grep "avx2 " /proc/cpuinfo)
								ifneq (,$(findstring avx2,$(AVX2_M)))
												CFLAGS += -mavx2

and that made it compile under Debian for me.

sachac avatar Nov 29 '22 21:11 sachac

Thanks @sachac , that did the trick. Makefile will need some extra logic to be patched I assume

JJGO avatar Nov 30 '22 13:11 JJGO

I also get this error on my Debian 11 server, with Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz ( no AVX2 only AVX )

Above patch to makefile allows compilation without AVX, however im struggling to compile with AVX support. happy to help where I can.

jaybinks avatar Dec 02 '22 09:12 jaybinks

I am also seeing this on Fedora 36. The Makefile patch mentioned above worked for me.

aryker avatar Dec 02 '22 14:12 aryker

@jaybinks Can you post the output of cat /proc/cpuinfo?

ggerganov avatar Dec 02 '22 18:12 ggerganov

@ggerganov cpu info as requested.

My CPU is a sandybridge architecture, however it shows avx instructions in the below output.

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x14 cpu MHz : 3800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 6822.28 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x14 cpu MHz : 3800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 6822.28 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x14 cpu MHz : 3800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 4 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 6822.28 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x14 cpu MHz : 3286.080 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 6822.28 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x14 cpu MHz : 2772.126 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 6822.28 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x14 cpu MHz : 2313.449 cache size : 8192 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 6822.28 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x14 cpu MHz : 3800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 4 apicid : 5 initial apicid : 5 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 6822.28 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz stepping : 7 microcode : 0x14 cpu MHz : 3800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown bogomips : 6822.28 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

jaybinks avatar Dec 03 '22 01:12 jaybinks

gcc -I. -mavx -march=haswell -pthread -c ggml.c -o ggml.o works as expected

gcc -I. -mavx -march=sandybridge -pthread -c ggml.c -o ggml.o or gcc -I. -mavx -march=native -pthread -c ggml.c -o ggml.o fails as above.

jaybinks avatar Dec 03 '22 02:12 jaybinks

ok this starts to get interesting .... dmidecode does NOT show AVX support !?

I found this thread... even though its about FreeBSD, its exactly what I'm seeing (https://www.dslreports.com/forum/r25939252-AVX-on-i7-2600k-and-ASUS-P8Z68-V-PRO)

dmidecode -tprocessor
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.6 present.

Handle 0x0004, DMI type 4, 42 bytes
Processor Information
	Socket Designation: LGA1155
	Type: Central Processor
	Family: Core 2 Duo
	Manufacturer: Intel            
	ID: A7 06 02 00 FF FB EB BF
	Signature: Type 0, Family 6, Model 42, Stepping 7
	Flags:
		FPU (Floating-point unit on-chip)
		VME (Virtual mode extension)
		DE (Debugging extension)
		PSE (Page size extension)
		TSC (Time stamp counter)
		MSR (Model specific registers)
		PAE (Physical address extension)
		MCE (Machine check exception)
		CX8 (CMPXCHG8 instruction supported)
		APIC (On-chip APIC hardware supported)
		SEP (Fast system call)
		MTRR (Memory type range registers)
		PGE (Page global enable)
		MCA (Machine check architecture)
		CMOV (Conditional move instruction supported)
		PAT (Page attribute table)
		PSE-36 (36-bit page size extension)
		CLFSH (CLFLUSH instruction supported)
		DS (Debug store)
		ACPI (ACPI supported)
		MMX (MMX technology supported)
		FXSR (FXSAVE and FXSTOR instructions supported)
		SSE (Streaming SIMD extensions)
		SSE2 (Streaming SIMD extensions 2)
		SS (Self-snoop)
		HTT (Multi-threading)
		TM (Thermal monitor supported)
		PBE (Pending break enabled)
	Version: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz       
	Voltage: 1.0 V
	External Clock: 100 MHz
	Max Speed: 3800 MHz
	Current Speed: 3439 MHz
	Status: Populated, Enabled
	Upgrade: Other
	L1 Cache Handle: 0x0005
	L2 Cache Handle: 0x0006
	L3 Cache Handle: 0x0007
	Serial Number: To Be Filled By O.E.M.
	Asset Tag: To Be Filled By O.E.M.
	Part Number: To Be Filled By O.E.M.
	Core Count: 4
	Core Enabled: 1
	Thread Count: 2
	Characteristics:
		64-bit capable

jaybinks avatar Dec 03 '22 23:12 jaybinks

@aryker what CPU do you have ?

jaybinks avatar Dec 18 '22 10:12 jaybinks

Interestingly... using intrin.sh from https://stackoverflow.com/questions/43128698/inlining-failed-in-call-to-always-inline-mm-mullo-epi32-target-specific-opti suggests we want -mf16c not -mavx , does this make sense ?

doesn't work for me on my machine ... it compiles but I get "Illegal instruction" at runtime.

A real head-scratcher, since I should have AVX support.

jaybinks avatar Dec 18 '22 10:12 jaybinks

yea so this is basically the same as https://github.com/ggerganov/whisper.cpp/issues/238

Sandy Bridge CPU ... no F16c support ... ok... ill give up and be happy with my little win from -Ofast

jaybinks avatar Dec 18 '22 11:12 jaybinks

@aryker what CPU do you have ?

Here's my cpuinfo file:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping	: 7
microcode	: 0x2f
cpu MHz		: 800.000
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips	: 4984.27
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping	: 7
microcode	: 0x2f
cpu MHz		: 959.352
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips	: 4984.27
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping	: 7
microcode	: 0x2f
cpu MHz		: 800.000
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips	: 4984.27
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping	: 7
microcode	: 0x2f
cpu MHz		: 1507.355
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips	: 4984.27
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

aryker avatar Dec 21 '22 12:12 aryker

@aryker This CPU has AVX but does not have F16C. This combination is currently not supported in ggml.c. The Makefile has to be adjusted to not add the -mavx flag if F16C is not available:

https://github.com/ggerganov/whisper.cpp/blob/4c1fe0c813135fbccc487b85f024e633a911aeba/Makefile#L53-L104

ggerganov avatar Dec 22 '22 16:12 ggerganov

Same problem here on an AMD Opteron(TM) Processor 6220 AMD flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid ex td_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 top oext perfctr_core perfctr_nb cpb hw_pstate ssbd vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

AVX supported but F16C is not. Compiled without optimization and is being incredibly slow, will linking to openblas make this better? Thanks

nico202 avatar Mar 31 '23 16:03 nico202

I can report similar issue here. On Xeon(R) CPU E5-2670 running Proxmox 7.4 i also had to disable "-mavx" to get code compiled.

I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:
I CC:       cc (Debian 10.2.1-6) 10.2.1 20210110
I CXX:      g++ (Debian 10.2.1-6) 10.2.1 20210110

cpuinfo tells AVX but no F16C:

processor	: 31
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping	: 7
microcode	: 0x718
cpu MHz		: 3300.000
cache size	: 20480 KB
physical id	: 1
siblings	: 16
core id		: 7
cpu cores	: 8
apicid		: 47
initial apicid	: 47
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips	: 5202.46
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

Doing that everything is slow both on 4 and 32 threads ./bench -m ./models/ggml-medium.bin

system_info: n_threads = 4 / 32 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings:     load time =  1107.48 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 191019.75 ms /     1 runs (191019.75 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 192274.55 ms
system_info: n_threads = 32 / 32 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings:     load time =  1100.58 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 54810.91 ms /     1 runs (54810.91 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 56057.46 ms

That makes my i3-7100 looking like a fast machine! ;)

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

whisper_print_timings:     load time =   396.43 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 38768.12 ms /     1 runs (38768.12 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 39215.11 ms
``

AlmightyFrog avatar Apr 02 '23 00:04 AlmightyFrog