whisper.cpp Core ML support

Core ML support

Open ggerganov opened this issue 1 year ago • 6 comments

Running Whisper inference on Apple Neural Engine (ANE) via Core ML

Huge thanks to @wangchou for demonstrating how to use Core ML and making the initial port: https://github.com/ggerganov/whisper.cpp/discussions/548

WIP IN PROGRESS Everything in this branch is subject to change

Currently, we have the Encoder fully running on the ANE through Core ML inference. The performance gain seems to be more than x3 compared to 8-thread CPU (tested for tiny, base and small models).

Here are initial performance benchmarks for the Encoder with (top) and without (bottom) Core ML:

CPU	OS	Config	Model	Th	Load [ms]	Encode [ms]	Commit
MacBook M1 Pro	MacOS 13.2.1	CORE ML	tiny	4	50	30	b0ac915
MacBook M1 Pro	MacOS 13.2.1	CORE ML	base	4	74	64	b0ac915
MacBook M1 Pro	MacOS 13.2.1	CORE ML	small	4	188	208	b0ac915
MacBook M1 Pro	MacOS 13.2.1	CORE ML	medium	4	533	1033	b0ac915
MacBook M1 Pro	MacOS 13.2.1	CORE ML	large	4	?	?	b0ac915
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	206fc93
---

Usage

Download the Core ML encoder .mlmodel and compile it to .mlmodelc:
```
./models/download-coreml-model.sh base.en
xcrun coremlc compile ./models/ggml-base.en.mlmodel ./models
```
The .mlmodel files are currently hosted at:

https://huggingface.co/datasets/ggerganov/whisper.cpp-coreml

Build whisper.cpp with Core ML support:

# using Makefile
make clean
WHISPER_COREML=1 make -j

# using CMake
cd build
cmake -DWHISPER_COREML=1 ..

Run the examples as usual. The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format. Next runs are faster.

TODO

[ ] Can the Decoder be ported to ANE too? https://github.com/ggerganov/whisper.cpp/discussions/548#discussioncomment-5199310
[ ] Convert the medium and large models to Core ML format and upload to HF Need a Mac Silicon with 64GB RAM to do the conversion from PyTorch -> Core ML
[ ] Unified ggml + coreml model file We currently load both the full ggml model (encoder + decoder) and the coreml encoder - not optimal
[ ] Scripts for generating Core ML model files (e.g. https://github.com/wangchou/callCoreMLFromCpp)
[ ] Support loading Core ML model from memory buffer Currently we support only loading from a folder on the disk
[ ] Progress report for initial-run model processing
[ ] Adjust memory usage buffers when using Core ML
[ ] Try to avoid the first on-device automatic model generation (it takes a long time)
[ ] The medium model takes more than 30 minutes to convert on the first run. Is there a work-around?
[ ] Can we run the Core ML inference on the GPU?

Mar 05 '23 09:03 ggerganov

Great work!

I tested coreml branch on Mac Mini M2 (base $599 model).

The performance gain seems to be more than x5 compared to 4-thread CPU (thanks to much faster ANE on M2, 8-thread CPU on Mac Mini M2 base model is slower than 4-thread).

Performance benchmarks for the Encoder with (top) and without (bottom) Core ML:

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Mac Mini M2	macOS 13.2.1	CORE ML	tiny	4	44	25	17a1459
Mac Mini M2	macOS 13.2.1	CORE ML	base	4	66	54	17a1459
Mac Mini M2	macOS 13.2.1	CORE ML	small	4	163	190	17a1459
Mac Mini M2	macOS 13.2.1	CORE ML	medium	4			17a1459
Mac Mini M2	macOS 13.2.1	CORE ML	large	4			17a1459

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Mac Mini M2	macOS 13.2.1	NEON BLAS	tiny	4	40	142	59fdcd1
Mac Mini M2	macOS 13.2.1	NEON BLAS	base	4	67	299	59fdcd1
Mac Mini M2	macOS 13.2.1	NEON BLAS	small	4	152	980	59fdcd1
Mac Mini M2	macOS 13.2.1	NEON BLAS	medium	4			59fdcd1
Mac Mini M2	macOS 13.2.1	NEON BLAS	large	4			59fdcd1

Mar 06 '23 16:03 brozkrut

I compiled whisper.cpp with coreml support using make as well I build the mlmodel but I'm getting an error

whisper_init_from_file: loading model from 'models/ggml-base.en.mlmodelc'
whisper_model_load: loading model
whisper_model_load: invalid model data (bad magic)
whisper_init: failed to load model
error: failed to initialize whisper context

Is there anything else I'm missing? 🤔

Mar 09 '23 11:03 DontEatOreo

@DontEatOreo

On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin. The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

Mar 09 '23 11:03 ggerganov

@ggerganov Благодаря ти! I was very confused why it wasn't working even though I did everything right

Mar 09 '23 12:03 DontEatOreo

This is great. Excited to see how this feature develops. Leveraging ANE would be huge, even more if the decoder was possible to port to it.

Mar 22 '23 19:03 dennislysenko

Just saw this was announced, is it useful? https://github.com/apple/ml-ane-transformers

Mar 24 '23 17:03 strangelearning

@DontEatOreo

On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin. The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

Apr 05 '23 18:04 cerupcat

Hey, thanks for this awesome project! I am trying to run the whisper.objc example with CoreML but running into some issues. Has someone successfully done this and could guide me on how to set it up?

Apr 14 '23 14:04 lucabeetz

@DontEatOreo On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin. The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

The solution is to produce encoder-only CoreML model in one file and decoder-only standard model in another file. This is not very difficult to achieve, but supporting so many model files might get too difficult for me. So probably I will rely on someone helping out and demonstrating how this can be done, either as an example in this repo or in a fork.

Apr 14 '23 17:04 ggerganov

This is getting almost ready to merge. I am hoping to do it tomorrow.

The most important part that currently needs testing is the creation of the CoreML models, following the instructions here:

https://github.com/ggerganov/whisper.cpp/discussions/548#discussioncomment-5327027

If you give this a try, please let us know the results and if you encountered any issues. Also, lets us know if you used quantized or not-quantized CoreML models and what has been the experience.

I believe that tiny, base and small models should be supported, while medium and large seem to not be viable for this approach.

Apr 14 '23 19:04 ggerganov

1.4gb for medium sounds fine for users, but you're saying there are other limitations against it?

Apr 14 '23 20:04 aehlke

@aehlke The scripts for generating Core ML models, support all sizes, but on my M1 Pro, it takes very long time (i.e. more than half an hour) to generate the medium model. After that, the first run is also very slow. Next runs are about 2 times faster compared to CPU-only.

In any case, you can follow the instructions in this PR and see how it works on your device.

Apr 15 '23 09:04 ggerganov

CPU OS Config Model Th Load Enc. Commit

MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML tiny 4 41 31 f19e23f

MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML base 4 59 57 f19e23f

MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML small 4 147 195 f19e23f

MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML medium 4 576 783 f19e23f

MacBook Air M2 MacOS 13.3.1 NEON BLAS COREML large 4 1196 2551 f19e23f

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	tiny	4	41	31	f19e23f
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	base	4	59	57	f19e23f
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	small	4	147	195	f19e23f
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	medium	4	576	783	f19e23f
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	large	4	1196	2551	f19e23f

Great work! It was consuming ~9.7GB (short peak 15.03GB) memory converting large model to ML model, it worked fine on 8GB Air.

Edit: I measured time of COREML model conversion and first loading conversion time (second-first).

Model	COREML conv	First Loading conv (sec)
tiny	4.915	0.72
base	8.564	1.34
small	26.050	4.72
medium	1:35.85	15.57
large	3:43.32	35.10

Apr 15 '23 19:04 neurostar

When running this script:

./models/generate-coreml-model.sh base.en

I got the error:

xcrun: error: unable to find utility "coremlc", not a developer tool or in PATH

Apr 16 '23 04:04 CarberryChai

Is it me or the link of CoreML models is missing on Hugging Face?

Btw, @ggerganov, if you need help converting the models, I'd be glad to contribute. It seems to me that it only needs to be done once. :)

Apr 16 '23 09:04 flexchar

For now, you should generate the Core ML models locally following the instructions. I don't want to host them on HF yet, because it is very likely that the models will change soon - there are a some pending improvements (see https://github.com/ggerganov/whisper.cpp/discussions/548#discussioncomment-5622733). If I upload them now, later we will get new models and everyone will be confused which model they are using, etc.

Apr 16 '23 10:04 ggerganov

In that regard I'd like to ask for help since I cant seem to succeed with it..

python3.10 ./models/convert-whisper-to-coreml.py --model tiny

100%|█████████████████████████████████████| 72.1M/72.1M [00:05<00:00, 14.3MiB/s]
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=384, n_audio_head=6, n_audio_layer=4, n_vocab=51865, n_text_ctx=448, n_text_state=384, n_text_head=6, n_text_layer=4)
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  scale = (n_state // self.n_head) ** -0.25
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋| 367/368 [00:00<00:00, 6681.50 ops/s]
Running MIL frontend_pytorch pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1047.63 passes/s]
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 147.77 passes/s]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2599.51 passes/s]
Traceback (most recent call last):
  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 331, in <module>
    decoder = convert_decoder(hparams, decoder, quantize=args.quantize)
  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 283, in convert_decoder
    traced_model = torch.jit.trace(model, (token_data, audio_data))
  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 741, in trace
    return trace_module(
  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 958, in trace_module
    module._c._create_method_from_trace(
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 211, in forward
    x = block(x, xa, mask=self.mask, kv_cache=kv_cache)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 138, in forward
    x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 83, in forward
    k = self.key(x if xa is None else xa)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 37, in forward
    return F.linear(
RuntimeError: mat1 and mat2 shapes cannot be multiplied (384x1500 and 384x384)

Apr 16 '23 11:04 flexchar

When running this script:

./models/generate-coreml-model.sh base.en

I got the error:

xcrun: error: unable to find utility "coremlc", not a developer tool or in PATH

I was able to resolve by sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer

Apr 16 '23 13:04 neurostar

Hi, which version of Python should I use to install these dependencies? I tried 3.11 and 3.10, but failed to install all dependcecies.

pip install ane_transformers
pip install openai-whisper
pip install coremltools

Apr 17 '23 02:04 flyisland

Hi, which version of Python should I use to install these dependencies? I tried 3.11 and 3.10, but failed to install all dependcecies.
pip install ane_transformers
pip install openai-whisper
pip install coremltools

https://github.com/openai/whisper/discussions/906#discussioncomment-4803242 My computer has both Python 3.9 and 3.11 installed. After setting the default configuration to 3.9, I still couldn't find the whisper module and had to uninstall Python 3.11 to make it work. This indicates that pip needs to be fully linked to a Python version below 3.10 to function properly.

Apr 17 '23 03:04 adolphnov

@adolphnov and @flyisland I have no idea how these Python versions work. I'm just using whatever is default on my M1. You can give me some commands I can run to tell you the versions that I have, or send a PR to improve the setup process.

@flexchar You are running the wrong script. Use ./models/generate-coreml-model.sh tiny as specified in the instructions

Apr 17 '23 09:04 ggerganov

Thank you G. To clarify for others, I also ran into the xcrun: error: unable to find utility "coremlc", not a developer tool or in PATH problem.

I didn't know (new to Mac) that I had to install Xcode. I also had trouble installing pip packages on python3.11 (latest at the time of writing this). So I purged python and fresh brew install [email protected]. Then I had to add this to my shell file export PATH="/opt/homebrew/opt/[email protected]/libexec/bin:$PATH" and then converting worked.

This is awesome.

Georgi, you should consider sponsor button on this repo. I believe there are many that appreciate your work. Thank you for doing this.

Ps. It took 2.5 min to convert the largest model. It's going really smooth.

Apr 17 '23 10:04 flexchar

Yes 3.11 fails for me as well during installing one of the package via pip, but 3.10.x should work (although converting "large" got stuck on my M1 Pro for hours so I had to force quit it; will try again later to see how it goes since it seems to work for others here).

For managing Python versions, you can also use managers such as pyenv or asdf to do so. You can set a local version so it always uses 3.10.x when you enter whisper.cpp directory, and use some other version elsewhere.

Apr 17 '23 10:04 wzxu

@ggerganov Thanks, I managed to install those dependencies using Python 3.9, but ran into the xcrun: error: unable to find utility "coremlc", not a developer tool or in PATH problem.

Do I need to install the Full Xcode Package to have the "coremlc"?

Apr 17 '23 13:04 flyisland

@flyisland as I've mentioned in my reply, yes, you need to install Xcode.

https://apps.apple.com/us/app/xcode/id497799835

Apr 17 '23 13:04 flexchar

My XCode installation was pointing at a wrong direction, so I used: sudo xcode-select --reset to resolve the missing coremlc problem. In practice it sets the active developer directory path to the same as mentioned earlier (/Applications/Xcode.app/Contents/Developer). You can check the current path with: xcode-select -p

Apr 17 '23 14:04 sriver

Hi @ggerganov, thanks for merging the CoreML branch into master!

I'm seeing a 10%-13% performance drop though, is that expected?

Running bench with

$ ./build-1.3.0/bin/bench -m models/ggml-small.en.bin
$ ./build-coreml/bin/bench -m models/ggml-small.en.bin

on MacBook Pro M1 Max:

Coreml branch [ms]	V1.3.0 [ms]
598	669
591	680
580	671

Thanks :)

Apr 17 '23 14:04 bjnortier

Yes 3.11 fails for me as well during installing one of the package via pip, but 3.10.x should work (although converting "large" got stuck on my M1 Pro for hours so I had to force quit it; will try again later to see how it goes since it seems to work for others here).

Still get stuck at this step for eternity with no more output whatsoever. 😕 Can't interrupt with ctrl-C either so had to quit terminal and force quit ANECompilerService. Any idea what may be the cause? M1 Pro with 16GB should be sufficient…? SCR-20230418-biqs

Apr 17 '23 17:04 wzxu

Thank you, @ggerganov and @flexchar. With Python 3.9 and the full Xcode package installed on my laptop, it is now working. I can see the message "whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'" in the output of both the ./main and ./stream programs.

Apr 17 '23 23:04 flyisland

Everything is working, except I'm getting the whisper_init_state: first run on a device may take a while ... notice (and resulting 15–30 minute wait) on every run. Is there some way around this?

Apr 18 '23 20:04 ecormany

whisper.cpp whisper.cpp copied to clipboard

Core ML support

Usage

TODO

whisper.cpp
whisper.cpp copied to clipboard