whisper.cpp
whisper.cpp copied to clipboard
Garbage transcription with OpenVino / NPU
cc @RyanMetcalfeInt8
Compiled the project with OpenVINO support (SHA-1: 1cf679dec4eca99aeaed4fe09a8092803bdecfc1). Using OpenVino 2023.3.0
It seems to work very nice on CPU and GPU, but on the MeteorLake NPU you get completely garbled output.
See: https://cloud.videocom.com/media/fi_01HNMGYR7028Y22JEWA5JF4XKK (Includes driver version) https://cloud.videocom.com/media/fi_01HNMKECMXTHTS5VGJ773VWF69 (Includes command line)
Compiled version and models here: https://drive.google.com/drive/folders/1oyQZHJQuO6I5EEjRtttaNy_6PF9F15bV
Hi @hlevring,
Yes, actually I just recently noticed this too. It appears that there was some update to one of the python packages (torch, onnx, etc. -- still trying to figure out exactly which one) which triggers an OpenVINO IR (.xml / .bin) to be produced which doesn't work well for NPU (the NPU compiler present in v1688 driver doesn't compile it properly).
- A fix is coming for this in a future NPU driver release.
- In the meantime, as a workaround, if you grab the ggml-base-encoder-openvino.xml / ggml-base-encoder-openvino.bin files that are located in openvino-models.zip found here for my Audacity AI plugins (https://github.com/intel/openvino-plugins-ai-audacity/releases), you should find that these work fine with NPU.
Thanks, Ryan
Yep, I can confirm it's absolutely working with the IR files from your link.
@hlevring Have you experimented with Whisper models larger than the base
? From my experience, languages other than English, such as Chinese, typically need at least the Whisper large
model to achieve acceptable accuracy from a user's perspective.
Yeah, Large is not going to work I think, but I got medium to work a while back. There were some hiccups getting stuck a few times, but generally it was working. (I have to get back to try it again at some point with updated IR, drivers and using OV 2024 )
To get better support for some targeted languages with a model that can run faster on Intel NPU/GPU I was thinking to get involved with training a language specific / multi language model using https://github.com/huggingface/distil-whisper. I think that is going to give both better results and faster inference speed compared to fine tuning a medium checkpoint. (Danish would be my use case due to some client requirements, I have not seen if someone might have already have done some work om large distilled with support for Chinese).