whisper.cpp Garbage transcription with OpenVino / NPU

cc @RyanMetcalfeInt8

Compiled the project with OpenVINO support (SHA-1: 1cf679dec4eca99aeaed4fe09a8092803bdecfc1). Using OpenVino 2023.3.0

It seems to work very nice on CPU and GPU, but on the MeteorLake NPU you get completely garbled output.

See: https://cloud.videocom.com/media/fi_01HNMGYR7028Y22JEWA5JF4XKK (Includes driver version) https://cloud.videocom.com/media/fi_01HNMKECMXTHTS5VGJ773VWF69 (Includes command line)

Compiled version and models here: https://drive.google.com/drive/folders/1oyQZHJQuO6I5EEjRtttaNy_6PF9F15bV

Feb 02 '24 13:02 hlevring

Hi @hlevring,

Yes, actually I just recently noticed this too. It appears that there was some update to one of the python packages (torch, onnx, etc. -- still trying to figure out exactly which one) which triggers an OpenVINO IR (.xml / .bin) to be produced which doesn't work well for NPU (the NPU compiler present in v1688 driver doesn't compile it properly).

A fix is coming for this in a future NPU driver release.
In the meantime, as a workaround, if you grab the ggml-base-encoder-openvino.xml / ggml-base-encoder-openvino.bin files that are located in openvino-models.zip found here for my Audacity AI plugins (https://github.com/intel/openvino-plugins-ai-audacity/releases), you should find that these work fine with NPU.

Thanks, Ryan

Feb 02 '24 15:02 RyanMetcalfeInt8

Yep, I can confirm it's absolutely working with the IR files from your link.

Feb 02 '24 16:02 hlevring

@hlevring Have you experimented with Whisper models larger than the base? From my experience, languages other than English, such as Chinese, typically need at least the Whisper large model to achieve acceptable accuracy from a user's perspective.

Apr 20 '24 17:04 changtimwu

Yeah, Large is not going to work I think, but I got medium to work a while back. There were some hiccups getting stuck a few times, but generally it was working. (I have to get back to try it again at some point with updated IR, drivers and using OV 2024 )

To get better support for some targeted languages with a model that can run faster on Intel NPU/GPU I was thinking to get involved with training a language specific / multi language model using https://github.com/huggingface/distil-whisper. I think that is going to give both better results and faster inference speed compared to fine tuning a medium checkpoint. (Danish would be my use case due to some client requirements, I have not seen if someone might have already have done some work om large distilled with support for Chinese).

Apr 21 '24 14:04 hlevring

whisper.cpp whisper.cpp copied to clipboard

Garbage transcription with OpenVino / NPU

whisper.cpp
whisper.cpp copied to clipboard