whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

'-Ofast' and '-march=native' provide significant speedup

Open ttsiodras opened this issue 1 year ago • 10 comments

'-Ofast' and '-march=native' cause 2x-speedup in machines with SSE (but no AVX) instructions. Should help other platforms, too.

ttsiodras avatar Dec 10 '22 11:12 ttsiodras

See https://github.com/ggerganov/whisper.cpp/issues/251 for details.

ttsiodras avatar Dec 10 '22 11:12 ttsiodras

I get:

c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?

idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.

luke-jr avatar Dec 10 '22 20:12 luke-jr

I get:

c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?

idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.

From the official GCC documentation ( https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html ):

-mcpu=cpu-type

A deprecated synonym for -mtune=cpu-type

So the compiler you tried this on, Luke, is probably a rather old version of GCC.

In fact, when I try -mcpu=native in my machine, I get:

cc  -I.              -O3 -std=c11   -fPIC  -Ofast -mcpu=native -pthread   -c ggml.c -o ggml.o
cc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead

Specs of my test: Arch Linux on Celeron N5095, with GCC12.2. The deprecation of -mcpu is quite a ways back, GCC-version wise. As for the speed difference between -march=native and -mtune=native, in my machine there is none.

ttsiodras avatar Dec 10 '22 20:12 ttsiodras

That was with GCC 11.3.0 for ppc64le. There is no -march on this platform at all.

luke-jr avatar Dec 10 '22 20:12 luke-jr

Well, like all things, this is a balancing act...

The usual autoconf/automake machinery can be used, to have "./configure" emit a Makefile that uses whatever options apply best to the current machine. I can do that for whisper.cpp, if @ggerganov is OK with the involved complexity.

But as-is, -Ofast -march=native will work on all Intel/AMD/ARM machines with a decade old GCC. A quick Google search shows the -mcpu deprecation since 2004! ( https://forums.gentoo.org/viewtopic-t-222477-start-0.html )

ttsiodras avatar Dec 10 '22 20:12 ttsiodras

So I'm not 100% sure what to do here. Btw, I've already done experiments with -ffast-math and -march flags:

https://github.com/ggerganov/whisper.cpp/blob/ea38ad6e70e2b4bd0c1a79f3e2cbfd99ad9393c3/CMakeLists.txt#L77-L80

On my MacBook, building with stock clang, it does not recognise -march flag:

$ make
clang: error: the clang compiler does not support '-march=native'

$ clang -v
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.1.0
Thread model: posix

Using -Ofast is equivalent to -O3 -ffast-math. Using -ffast-math does bring ~10% performance gain even on my machine. I am also aware of what -ffast-math does and what are the "side-effects" to the computation and at the moment I don't think it really hurts adding it. It will become a problem if someday we want to make whisper.cpp produce the same exact results across different CPUs - this is not the case today.

Let's think about this some more. Maybe we can hear more points of view on this topic and get better insight.

ggerganov avatar Dec 11 '22 11:12 ggerganov

So I'm not 100% sure what to do here.

Well, this is what autoconf/automake were built for: to pick the best compilation options possible for the specific target we are building on. IMHO it's a shame to leave a 2x speedup on the table...

I could write the necessary configure.ac/Makefile.am (the sources for autoconf/automake-based builds). We would then automatically get a configure script, that would try a series of compilation options and build a Makefile tailor-made for the machine we work in. Would that be acceptable? Or are you opposed to autoconf/automake?

ttsiodras avatar Dec 11 '22 15:12 ttsiodras

Just one more note: in GCC land, you can ask the compiler to emit instruction-set-specific versions of the functions, and dispatch appropriately at run-time, based on the machine we run on: https://github.com/ttsiodras/MandelbrotSSE/blob/master/src/xaos.cc#L31 I used that to get maximum flexibility in there - worked like a charm. I don't know if clang supports that, though.

ttsiodras avatar Dec 11 '22 16:12 ttsiodras

In order for you to have more information on the autoconf/automake decision, I just pushed a few commits - you can try it out and decide for yourself.

  • For now, I only implemented SSE-checking ( https://github.com/ttsiodras/whisper.cpp/blob/master/configure.ac#L83 ) but I hope the pattern is clear enough. To add support for any other instructions you want, you just add the relevant assembly check, and then emit the compiler option you want. You also get a #define inside the auto-generated config.h, so you can make more compile-time decisions with #ifdef in your C/C++ code.

  • I also added SDL2 checks, since I saw 2 of your binaries needed it. They detect and use SDL2 fine in my tests here.

To see for yourself: after you clone my version of the repo, launch ./configure and make.

To modify the logic, edit configure.ac and/or Makefile.am - then launch ./bootstrap. This simply invokes autoreconf and automake, creating an updated version of the ./configure machinery.

ttsiodras avatar Dec 11 '22 16:12 ttsiodras

+1 to autotools. That would also make it simpler to libtoolise the library and make the examples link to it.

luke-jr avatar Dec 11 '22 17:12 luke-jr

@ttsiodras Thanks for the effort, but the automake stuff is not for this project - it's too complicated

I did a few tests with and without -Ofast -march=native on different machines and here are the results:

-O3

CPU OS Config Model Th Load Enc. Commit
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 4 137 297 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 4 183 665 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 4 373 2328 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 4 923 7346 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 4 1681 14053 b8065d9
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 4 122 572 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 4 153 1303 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 4 305 4844 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 4 750 16117 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 4 1331 37618 b8065d9
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 4 68 170 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 4 97 327 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 221 1069 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 4 581 2873 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 4 1170 5173 b8065d9

-Ofast -march=native

CPU OS Config Model Th Load Enc. Commit
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 4 137 320 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 4 180 721 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 4 366 2554 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 4 900 8181 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 4 1614 15679 b8065d9
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 4 123 558 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 4 154 1289 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 4 308 4775 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 4 749 16576 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 4 1320 30650 b8065d9
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 4 69 154 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 4 94 291 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 219 948 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 4 610 2582 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 4 1205 4692 b8065d9

Lower Enc. is better.

  • On Ryzen 9 5950X these flags actually make the performance worse by ~10%
  • On Ryzen 9 3900X there is ~20% improvement on the large model and almost no improvement on the other models
  • On MacBook M1 Pro there is ~10% improvement across all models

Given these results, I don't think it is crucial to have these flags. Sometimes they help, sometimes they don't. Even if the benefit on a no-AVX CPU is as big as 2 times, I still don't think it is necessary to add them in general.

So I think for now, I will leave the existing Makefile as it is.

ggerganov avatar Dec 16 '22 18:12 ggerganov