Chapterize-Audiobooks feat: use whisper as ML model

Replaces vosk with faster_whisper, resulting in much faster audio transcriptions. On my system the same audio file took multiple hours with vosk and less than 10 minutes with the patch.

Nov 23 '24 12:11 FineFindus

It does appear faster, though there's a lot of work that would need to be done to do this.

Option changes need made for whisper, including model selections. The model is hard coded at the moment.
Installation instructions for whisper, which can be a pain depending on what versions of python and libraries you have installed
Faster-whisper is different that whisper and needs installed
Instructions/options for CPU vs GPU usage of whisper
It appears to do translation in roughly 30 second chunks, which isn't high enough resolution to find a chapter boundary. The current vosk solution translates in 2-3 second chunks. I found https://github.com/linto-ai/whisper-timestamped, which might work, though its not clear if it works with faster-whisper.

This has a lot of potential. Whisper is pretty awesome.

Nov 24 '24 17:11 GNO21

Help is welcome :)
faster_whisper handles the download of the models.
see 2.
Should also be handled by faster_whisper, but I can't verify.
Not sure what you mean by this? I've been running it like this for about 1/2 year and have had no problems with timestamp accuracy compared to vosk.

Nov 24 '24 19:11 FineFindus

I need to learn more about how faster_whisper works and how it modifies what whisper does :)
There are different inputs to the WhisperModel( ) function to enable CUDA. Whisper itself has different flags to install for GPU support. I don't know enough about faster_whisper. I get 0% GPU usage with the current flags. I can get around 90% GPU usage with WhisperModel(model_size, device="cuda", compute_type="float32") on an nVidia GPU.
If you look at the text output (.srt file), it puts a single timestamp every 30s or more with faster-whisper. Vosk puts a timestamp every 2-3s or so. This means that when the program goes through and searches for keywords, the closest timestamp when using the whisper model can be around 30s from the actual found "chapter marker" word.

Nov 24 '24 20:11 GNO21

There are different inputs to the WhisperModel( ) function to enable CUDA. Whisper itself has different flags to install for GPU support. I don't know enough about faster_whisper. I get 0% GPU usage with the current flags. I can get around 90% GPU usage with WhisperModel(model_size, device="cuda", compute_type="float32") on an nVidia GPU.

The default device is set to auto, so I would've assumed it used CUDA when available?

Nov 25 '24 08:11 FineFindus

I did some digging and found this: https://github.com/m-bain/whisperX

Potentially solves every issue:

Has word-level timestamps
Multi-speaker support (through diarization)
Uses faster-whisper
CUDA support for those with NVIDIA GPUs

Thoughts? If it seems like it might work, I can go ahead and work on a PR for it.

Dec 07 '24 19:12 ra-ven

Thoughts? If it seems like it might work, I can go ahead and work on a PR for it.

Can't hurt to try. It looks relatively similar to faster_whisper, so feel free to base your PR on top of this one :)

Dec 08 '24 20:12 FineFindus