whisper.cpp Metal support

This is quick and dirty implementation of GPU support for Apple hardware using Metal Performance Shaders. It demonstrates how part of the feed forward layer in the encoder can be offloaded to the GPU.

On my MacBook M1 Pro, I don't observe significant performance gain compared to the original implementation. Either I have a problem in my MPS integration, or simply the AMX coprocessor is doing a good enough job and adding Metal does not really help.

In any case, this PR can be a good starting point for anyone interested in adding GPU support to ggml. I think a similar approach can be taken for CUDA.

For now, I don't plan to merge this into master unless the performance gets better.

Nov 07 '22 19:11 ggerganov

can't make it on M1 Max:

c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate Undefined symbols for architecture arm64: "_ggml_mtl_alloc", referenced from: _ggml_new_tensor_mtl_impl in ggml.o "_ggml_mtl_init", referenced from: _ggml_init in ggml.o "_ggml_mtl_mul_mat_f16", referenced from: _ggml_compute_forward_mul_mat_f16_f32 in ggml.o ld: symbol(s) not found for architecture arm64 clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [main] Error 1

Nov 11 '22 23:11 DiegoGiovany

@DiegoGiovany Forgot to update the Makefile - it should work now. make clean + make

Nov 12 '22 06:11 ggerganov

This may or may not be helpful, but Warren Moore writes:

I don’t have any Apple Silicon devices, nor do I know much about ML or Whisper, so I’m not of much help.

But, the use of managed buffers without the use of explicit synchronization (via a blit encoder) is suspicious; I don’t see how this could work on a discrete GPU as-written.

Also, I’m not sure if the data dependencies allow concurrent execution, but calling -waitUntilCompleted forces the CPU thread to wait for GPU work to finish. There would be less overhead if encoders could be batched into fewer command buffers.

Finally, it leaks all of the Metal resources it creates, since ARC is disabled in the target. Any thread that encodes commands should have an autorelease pool, and resources should be explicitly released if ARC is disabled.

Dec 10 '22 03:12 latenitefilms

Hi. Firstly, thanks for this repo. This project is awesome!

Forgive me if im incorrect in understanding the ramifications of this, but one thought after a brief look at this PR - it might make sense to decouple the command buffer commit / wait / read back cycle from each function call like in ggml_mtl_mul_mat_vec_f16

Is it feasible to rather, commit the first set of operations to a MTLBuffer as necessary, and then keep compute on the GPU, and encode all of the multiplies in a single command buffer, dont read back , and do single

    [commandBuffer commit];
     [commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> conformBuffer) :^{
         memcpy(...)
     }];

At the very end of calculation? This would remove any CPU / GPU pipeline stalls, keep compute on the GPU, and also allow for some work to be done on the CPU while waiting for the GPU to complete.

Forgive me if I dont get side effects of this proposed change (im not familiar enough with the internals of how Whisper works).

Thank you!

Dec 20 '22 21:12 vade

@vade Yes, absolutely. It's definitely better to put as much operations as possible in a single command buffer and only read the data once at the end. The thing is that this will require refactoring of the ggml interface or something clever.

For example, if I have the following operations:

auto c = ggml_mul_mat(ctx, a, b)
auto e = ggml_mul_mat(ctx, c, d);
auto g = ggml_mul_mat(ctx, e, f);

// do something with "g"
...

Ideally I would want this to be a single command buffer with 3 matrix multiplications that starts with a and b as input and returns g without waiting for the intermediate results. To achieve that, we either need some clever logic in ggml to determine when a command buffer starts and ends by analysing the forward compute graph, or we need some explicit ggml interface calls to be called manually by the user whenever CPU/GPU synchronisation has to occur.

The proposal in this PR is a very rough starting point and is for sure far from optimal. Many things can be improved.

Dec 23 '22 08:12 ggerganov

Thanks @ggerganov - and to be clear, I wasn't trying to point out any flaws, I'm aware this entire endeavor is a work in progress and theres a lot of moving pieces (and bravo on that!).

I was hesitant to mention only because I'm not entirely familiar with the code base or whispers internals as implemented here.

Does it make sense break down some changes that would benefit pipelining to GPU for all supported platforms? My suspicion is that anything metal benefits from would benefit CUDA, etc.

If I may propose a few baby steps to break this potentially large change into manageable changes for all platforms and make integration easier?

Identify which functions are ripe for pipelining, and which groups of layers in the whisper encoder / decoder can benefit from GPU work
Refactor the method signatures of those functions without any changes the how the code is currently working. This would allow for a baseline
Identify locations that require GPU synchronization in the code.
Stub in a GPU submit / blocking / wait for GPU to finish function with a defined method signature that doesnt actually do anything
Use the above code as a rebase for the metal branch, giving us entry points to add the pipelined GPU operations and allow for other platforms to eventually benefit from the proposed changes.

Apologies, im not intending to step in and try to manage your project, just to start a conversation and make a set of actionable proposals that the community can rally around :)

Thank you, and again, this project is really awesome.

My assumptions for changes would

new function creates a GPU context
new function creates a GPU command buffer from the create context/ command queue method above
refactor the method signature of existing functions requiring GPU to take an additional argument (the command buffer)
new function that handles GPU submission and blocking, which takes in the created commend buffer from the new function above.

LMK - I'm happy to help, and potentially even sponsor some of this development.

Dec 23 '22 16:12 vade

Hi, Just curious if this still on the roadmap and being actively worked on? Thanks for your hard work.

Apr 14 '23 10:04 voidfel

https://github.com/ggerganov/llama.cpp/pull/1642 llama.cpp has been updated to support metal. I hope that whisper.cpp will also be updated to have the same capability.

Jun 08 '23 16:06 williamjeong2

Yes, it will come for sure

Jun 08 '23 17:06 ggerganov

It is already optimized for Apple silicon via ARM NEON, Accelerate framework and Core ML. I am using the medium.en model and it is super fast on my M1 Pro 16GB, it is absolutely amazing. Only the first run will be slow. Can Metal make it even faster? That would be unbelievable.

Jun 14 '23 12:06 voidfel

whisper.cpp whisper.cpp copied to clipboard

Metal support

whisper.cpp
whisper.cpp copied to clipboard