whisper.cpp
whisper.cpp copied to clipboard
'-Ofast' and '-march=native' provide significant speedup
'-Ofast' and '-march=native' cause 2x-speedup in machines with SSE (but no AVX) instructions. Should help other platforms, too.
See https://github.com/ggerganov/whisper.cpp/issues/251 for details.
I get:
c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?
idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.
I get:
c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?
idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.
From the official GCC documentation ( https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html ):
-mcpu=cpu-type
A deprecated synonym for -mtune=cpu-type
So the compiler you tried this on, Luke, is probably a rather old version of GCC.
In fact, when I try -mcpu=native
in my machine, I get:
cc -I. -O3 -std=c11 -fPIC -Ofast -mcpu=native -pthread -c ggml.c -o ggml.o
cc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead
Specs of my test: Arch Linux on Celeron N5095, with GCC12.2. The deprecation of -mcpu is quite a ways back, GCC-version wise. As for the speed difference between -march=native
and -mtune=native
, in my machine there is none.
That was with GCC 11.3.0 for ppc64le. There is no -march
on this platform at all.
Well, like all things, this is a balancing act...
The usual autoconf/automake machinery can be used, to have "./configure" emit a Makefile that uses whatever options apply best to the current machine. I can do that for whisper.cpp, if @ggerganov is OK with the involved complexity.
But as-is, -Ofast -march=native
will work on all Intel/AMD/ARM machines with a decade old GCC. A quick Google search shows the -mcpu
deprecation since 2004! ( https://forums.gentoo.org/viewtopic-t-222477-start-0.html )
So I'm not 100% sure what to do here.
Btw, I've already done experiments with -ffast-math
and -march
flags:
https://github.com/ggerganov/whisper.cpp/blob/ea38ad6e70e2b4bd0c1a79f3e2cbfd99ad9393c3/CMakeLists.txt#L77-L80
On my MacBook, building with stock clang
, it does not recognise -march
flag:
$ make
clang: error: the clang compiler does not support '-march=native'
$ clang -v
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.1.0
Thread model: posix
Using -Ofast
is equivalent to -O3 -ffast-math
.
Using -ffast-math
does bring ~10% performance gain even on my machine. I am also aware of what -ffast-math
does and what are the "side-effects" to the computation and at the moment I don't think it really hurts adding it. It will become a problem if someday we want to make whisper.cpp
produce the same exact results across different CPUs - this is not the case today.
Let's think about this some more. Maybe we can hear more points of view on this topic and get better insight.
So I'm not 100% sure what to do here.
Well, this is what autoconf/automake were built for: to pick the best compilation options possible for the specific target we are building on. IMHO it's a shame to leave a 2x speedup on the table...
I could write the necessary configure.ac/Makefile.am
(the sources for autoconf/automake-based builds). We would then automatically get a configure
script, that would try a series of compilation options and build a Makefile tailor-made for the machine we work in. Would that be acceptable? Or are you opposed to autoconf/automake?
Just one more note: in GCC land, you can ask the compiler to emit instruction-set-specific versions of the functions, and dispatch appropriately at run-time, based on the machine we run on: https://github.com/ttsiodras/MandelbrotSSE/blob/master/src/xaos.cc#L31 I used that to get maximum flexibility in there - worked like a charm. I don't know if clang supports that, though.
In order for you to have more information on the autoconf/automake decision, I just pushed a few commits - you can try it out and decide for yourself.
-
For now, I only implemented SSE-checking ( https://github.com/ttsiodras/whisper.cpp/blob/master/configure.ac#L83 ) but I hope the pattern is clear enough. To add support for any other instructions you want, you just add the relevant assembly check, and then emit the compiler option you want. You also get a
#define
inside the auto-generatedconfig.h
, so you can make more compile-time decisions with#ifdef
in your C/C++ code. -
I also added SDL2 checks, since I saw 2 of your binaries needed it. They detect and use SDL2 fine in my tests here.
To see for yourself: after you clone my version of the repo, launch ./configure
and make
.
To modify the logic, edit configure.ac
and/or Makefile.am
- then launch ./bootstrap
. This simply invokes autoreconf
and automake
, creating an updated version of the ./configure
machinery.
+1 to autotools. That would also make it simpler to libtoolise the library and make the examples link to it.
@ttsiodras
Thanks for the effort, but the automake
stuff is not for this project - it's too complicated
I did a few tests with and without -Ofast -march=native
on different machines and here are the results:
-O3
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | tiny | 4 | 137 | 297 | b8065d9 |
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | base | 4 | 183 | 665 | b8065d9 |
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | small | 4 | 373 | 2328 | b8065d9 |
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | medium | 4 | 923 | 7346 | b8065d9 |
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | large | 4 | 1681 | 14053 | b8065d9 |
--- | |||||||
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | tiny | 4 | 122 | 572 | b8065d9 |
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | base | 4 | 153 | 1303 | b8065d9 |
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | small | 4 | 305 | 4844 | b8065d9 |
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | medium | 4 | 750 | 16117 | b8065d9 |
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | large | 4 | 1331 | 37618 | b8065d9 |
--- | |||||||
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | tiny | 4 | 68 | 170 | b8065d9 |
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | base | 4 | 97 | 327 | b8065d9 |
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | small | 4 | 221 | 1069 | b8065d9 |
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | medium | 4 | 581 | 2873 | b8065d9 |
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | large | 4 | 1170 | 5173 | b8065d9 |
-Ofast -march=native
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | tiny | 4 | 137 | 320 | b8065d9 |
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | base | 4 | 180 | 721 | b8065d9 |
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | small | 4 | 366 | 2554 | b8065d9 |
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | medium | 4 | 900 | 8181 | b8065d9 |
Ryzen 9 5950X | Ubuntu 22.04 | AVX2 | large | 4 | 1614 | 15679 | b8065d9 |
--- | |||||||
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | tiny | 4 | 123 | 558 | b8065d9 |
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | base | 4 | 154 | 1289 | b8065d9 |
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | small | 4 | 308 | 4775 | b8065d9 |
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | medium | 4 | 749 | 16576 | b8065d9 |
Ryzen 9 3900X | Ubuntu 20.04 | AVX2 | large | 4 | 1320 | 30650 | b8065d9 |
--- | |||||||
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | tiny | 4 | 69 | 154 | b8065d9 |
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | base | 4 | 94 | 291 | b8065d9 |
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | small | 4 | 219 | 948 | b8065d9 |
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | medium | 4 | 610 | 2582 | b8065d9 |
MacBook M1 Pro | MacOS 13.0.1 | NEON BLAS | large | 4 | 1205 | 4692 | b8065d9 |
Lower Enc.
is better.
- On
Ryzen 9 5950X
these flags actually make the performance worse by ~10% - On
Ryzen 9 3900X
there is ~20% improvement on thelarge
model and almost no improvement on the other models - On
MacBook M1 Pro
there is ~10% improvement across all models
Given these results, I don't think it is crucial to have these flags. Sometimes they help, sometimes they don't. Even if the benefit on a no-AVX CPU is as big as 2 times, I still don't think it is necessary to add them in general.
So I think for now, I will leave the existing Makefile as it is.