whisper.cpp '-Ofast' and '-march=native' provide significant speedup

'-Ofast' and '-march=native' provide significant speedup

Open ttsiodras opened this issue 1 year ago • 10 comments

'-Ofast' and '-march=native' cause 2x-speedup in machines with SSE (but no AVX) instructions. Should help other platforms, too.

Dec 10 '22 11:12 ttsiodras

See https://github.com/ggerganov/whisper.cpp/issues/251 for details.

Dec 10 '22 11:12 ttsiodras

I get:

c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?

idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.

Dec 10 '22 20:12 luke-jr

I get:
c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?
idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.

From the official GCC documentation ( https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html ):

-mcpu=cpu-type

A deprecated synonym for -mtune=cpu-type

So the compiler you tried this on, Luke, is probably a rather old version of GCC.

In fact, when I try -mcpu=native in my machine, I get:

cc  -I.              -O3 -std=c11   -fPIC  -Ofast -mcpu=native -pthread   -c ggml.c -o ggml.o
cc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead

Specs of my test: Arch Linux on Celeron N5095, with GCC12.2. The deprecation of -mcpu is quite a ways back, GCC-version wise. As for the speed difference between -march=native and -mtune=native, in my machine there is none.

Dec 10 '22 20:12 ttsiodras

That was with GCC 11.3.0 for ppc64le. There is no -march on this platform at all.

Dec 10 '22 20:12 luke-jr

Well, like all things, this is a balancing act...

The usual autoconf/automake machinery can be used, to have "./configure" emit a Makefile that uses whatever options apply best to the current machine. I can do that for whisper.cpp, if @ggerganov is OK with the involved complexity.

But as-is, -Ofast -march=native will work on all Intel/AMD/ARM machines with a decade old GCC. A quick Google search shows the -mcpu deprecation since 2004! ( https://forums.gentoo.org/viewtopic-t-222477-start-0.html )

Dec 10 '22 20:12 ttsiodras

So I'm not 100% sure what to do here. Btw, I've already done experiments with -ffast-math and -march flags:

https://github.com/ggerganov/whisper.cpp/blob/ea38ad6e70e2b4bd0c1a79f3e2cbfd99ad9393c3/CMakeLists.txt#L77-L80

On my MacBook, building with stock clang, it does not recognise -march flag:

$ make
clang: error: the clang compiler does not support '-march=native'

$ clang -v
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.1.0
Thread model: posix

Using -Ofast is equivalent to -O3 -ffast-math. Using -ffast-math does bring ~10% performance gain even on my machine. I am also aware of what -ffast-math does and what are the "side-effects" to the computation and at the moment I don't think it really hurts adding it. It will become a problem if someday we want to make whisper.cpp produce the same exact results across different CPUs - this is not the case today.

Let's think about this some more. Maybe we can hear more points of view on this topic and get better insight.

Dec 11 '22 11:12 ggerganov

So I'm not 100% sure what to do here.

Well, this is what autoconf/automake were built for: to pick the best compilation options possible for the specific target we are building on. IMHO it's a shame to leave a 2x speedup on the table...

I could write the necessary configure.ac/Makefile.am (the sources for autoconf/automake-based builds). We would then automatically get a configure script, that would try a series of compilation options and build a Makefile tailor-made for the machine we work in. Would that be acceptable? Or are you opposed to autoconf/automake?

Dec 11 '22 15:12 ttsiodras

Just one more note: in GCC land, you can ask the compiler to emit instruction-set-specific versions of the functions, and dispatch appropriately at run-time, based on the machine we run on: https://github.com/ttsiodras/MandelbrotSSE/blob/master/src/xaos.cc#L31 I used that to get maximum flexibility in there - worked like a charm. I don't know if clang supports that, though.

Dec 11 '22 16:12 ttsiodras

In order for you to have more information on the autoconf/automake decision, I just pushed a few commits - you can try it out and decide for yourself.

For now, I only implemented SSE-checking ( https://github.com/ttsiodras/whisper.cpp/blob/master/configure.ac#L83 ) but I hope the pattern is clear enough. To add support for any other instructions you want, you just add the relevant assembly check, and then emit the compiler option you want. You also get a #define inside the auto-generated config.h, so you can make more compile-time decisions with #ifdef in your C/C++ code.
I also added SDL2 checks, since I saw 2 of your binaries needed it. They detect and use SDL2 fine in my tests here.

To see for yourself: after you clone my version of the repo, launch ./configure and make.

To modify the logic, edit configure.ac and/or Makefile.am - then launch ./bootstrap. This simply invokes autoreconf and automake, creating an updated version of the ./configure machinery.

Dec 11 '22 16:12 ttsiodras

+1 to autotools. That would also make it simpler to libtoolise the library and make the examples link to it.

Dec 11 '22 17:12 luke-jr

@ttsiodras Thanks for the effort, but the automake stuff is not for this project - it's too complicated

I did a few tests with and without -Ofast -march=native on different machines and here are the results:

-O3

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	4	137	297	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	4	183	665	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	4	373	2328	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	4	923	7346	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	4	1681	14053	b8065d9
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	4	122	572	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	4	153	1303	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	4	305	4844	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	4	750	16117	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	4	1331	37618	b8065d9
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	4	68	170	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	4	97	327	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	221	1069	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	4	581	2873	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	4	1170	5173	b8065d9

-Ofast -march=native

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	4	137	320	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	4	180	721	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	4	366	2554	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	4	900	8181	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	4	1614	15679	b8065d9
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	4	123	558	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	4	154	1289	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	4	308	4775	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	4	749	16576	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	4	1320	30650	b8065d9
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	4	69	154	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	4	94	291	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	219	948	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	4	610	2582	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	4	1205	4692	b8065d9

Lower Enc. is better.

On Ryzen 9 5950X these flags actually make the performance worse by ~10%
On Ryzen 9 3900X there is ~20% improvement on the large model and almost no improvement on the other models
On MacBook M1 Pro there is ~10% improvement across all models

Given these results, I don't think it is crucial to have these flags. Sometimes they help, sometimes they don't. Even if the benefit on a no-AVX CPU is as big as 2 times, I still don't think it is necessary to add them in general.

So I think for now, I will leave the existing Makefile as it is.

Dec 16 '22 18:12 ggerganov

whisper.cpp whisper.cpp copied to clipboard

'-Ofast' and '-march=native' provide significant speedup

whisper.cpp
whisper.cpp copied to clipboard