feat: use whisper as ML model
Replaces vosk with faster_whisper, resulting in much faster audio transcriptions. On my system the same audio file took multiple hours with vosk and less than 10 minutes with the patch.
It does appear faster, though there's a lot of work that would need to be done to do this.
- Option changes need made for whisper, including model selections. The model is hard coded at the moment.
- Installation instructions for whisper, which can be a pain depending on what versions of python and libraries you have installed
- Faster-whisper is different that whisper and needs installed
- Instructions/options for CPU vs GPU usage of whisper
- It appears to do translation in roughly 30 second chunks, which isn't high enough resolution to find a chapter boundary. The current vosk solution translates in 2-3 second chunks. I found https://github.com/linto-ai/whisper-timestamped, which might work, though its not clear if it works with faster-whisper.
This has a lot of potential. Whisper is pretty awesome.
- Help is welcome :)
faster_whisperhandles the download of the models.- see 2.
- Should also be handled by
faster_whisper, but I can't verify. - Not sure what you mean by this? I've been running it like this for about 1/2 year and have had no problems with timestamp accuracy compared to vosk.
-
I need to learn more about how faster_whisper works and how it modifies what whisper does :)
-
There are different inputs to the WhisperModel( ) function to enable CUDA. Whisper itself has different flags to install for GPU support. I don't know enough about faster_whisper. I get 0% GPU usage with the current flags. I can get around 90% GPU usage with WhisperModel(model_size, device="cuda", compute_type="float32") on an nVidia GPU.
-
If you look at the text output (.srt file), it puts a single timestamp every 30s or more with faster-whisper. Vosk puts a timestamp every 2-3s or so. This means that when the program goes through and searches for keywords, the closest timestamp when using the whisper model can be around 30s from the actual found "chapter marker" word.
There are different inputs to the WhisperModel( ) function to enable CUDA. Whisper itself has different flags to install for GPU support. I don't know enough about faster_whisper. I get 0% GPU usage with the current flags. I can get around 90% GPU usage with WhisperModel(model_size, device="cuda", compute_type="float32") on an nVidia GPU.
The default device is set to auto, so I would've assumed it used CUDA when available?
I did some digging and found this: https://github.com/m-bain/whisperX
Potentially solves every issue:
- Has word-level timestamps
- Multi-speaker support (through diarization)
- Uses faster-whisper
- CUDA support for those with NVIDIA GPUs
Thoughts? If it seems like it might work, I can go ahead and work on a PR for it.
Thoughts? If it seems like it might work, I can go ahead and work on a PR for it.
Can't hurt to try. It looks relatively similar to faster_whisper, so feel free to base your PR on top of this one :)