whisper
whisper copied to clipboard
word-level timestamps in `transcribe()`
This DTW dependency introduces a licence incompatibility, but an alternative was suggested earlier in the discussions from memory.
Edit: Alternative library recommended in https://github.com/openai/whisper/discussions/813#discussioncomment-4617447
Hi!
I tried out this branch with kwargs['word_level_timestamps'] = True
but the model performed very slowly. In addition (or rather because of) it started to hallucinate like mad.
Im using chunks of short (couple of seconds) audio data in german produced by a VAD for live transcription.
Maybe its a problem on my side, maybe anyone can try to reproduce?
Thanks for the comments, all -- this is work in progress and not quite ready for merging. I'm trying to address both hallucination and performance concerns.
Yet another DTW implementation, fyi. Can't vouch for it other than to say that it is Apache licensed, recently updated, has both pure Python and C implementations.
https://github.com/wannesm/dtaidistance
Hi, thanks for the great work!
I would like to ask if it is safe to swap to a smaller model (e.g. tiny) for world-level alignment to compute attention scores instead of using the same model (e.g. medium or large ) used to generate transcription. I suspect it could improve performance in terms of inference speed if this option would be supported.
I found an interesting edge case with the small
model where enabling the word-level timestamps option causes it to repeat the prompt at the end of the audio while also failing to infer the last word.
$ ffmpeg -t 29 -i https://audio2.redcircle.com/episodes/6b196013-8672-43d9-be52-4332b3207d93/stream.mp3 test.mp3
$ whisper --model small test.mp3
.../whisper/transcribe.py:98: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:15.920] Military veteran Eric Weinstein began 69 Whiskey as a college radio show on 107.7 The
[00:15.920 --> 00:21.720] Bronx, located on the campus of Ryder University in Lawrenceville, New Jersey.
[00:21.720 --> 00:27.560] A show once restrained by rules and boundaries now comes straight to you raw, uncensored and
[00:27.560 --> 00:28.960] unapologetic.
$ whisper --model small --output_format json --word_timestamps True test.mp3
.../whisper/transcribe.py:98: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:08.040 --> 00:15.940] Military veteran Eric Weinstein began 69 Whiskey as a college radio show on 107.7 The
[00:15.940 --> 00:21.320] Bronx, located on the campus of Ryder University in Lawrenceville, New Jersey.
[00:21.720 --> 00:28.980] A show once restrained by rules and boundaries now comes straight to you raw, uncensored and
[00:28.960 --> 00:28.960] Military veteran Eric Weinstein began 69 Whiskey as a college radio show on 107.7 The
[00:28.960 --> 00:28.960] Bronx, located on the campus of Ryder University in Lawrenceville, New Jersey.
[00:28.960 --> 00:28.960] A show once restrained by rules and boundaries now comes straight to you raw, uncensored and
[00:28.960 --> 00:28.960] Military veteran Eric Weinstein began 69 Whiskey as a college radio show on 107.7 The
[00:28.960 --> 00:28.960] Bronx, located on the campus of Ryder University in Lawrenceville, New Jersey.
Hi @jongwook , Since you first release the notebook to obtain word-level timestamps I've been working on this to add to whisper process. And I've tried to test other alingment methods than DTW. Have you tried something else and found out that it works better?
Also, I've been struggling a lot with alucinations, specially for spanish content. I've create a cleaner function at segmet level, is there any smarter way?
is there any chance to have word level timestamps in Whisper API?
Hi @IgnacioSan22, the custom DTW implementation in this PR was for the license issue as noted by others and also for the speed. An alternative is to use the timestamp predictions from the model, but we found that it's less reliable than using the attention patterns like in this PR. If you have solutions using any other algorithms for alignment, please let me know!
The community had some success handling hallucinations by preprocessing the inputs with VAD, like:
- #679
- #397
- https://github.com/m-bain/whisperX
Hi @ioskevinshah, this feature is still experimental but we do plan to add it to the API as an option, once we're sure that it's reliable enough.
@jongwook is there a way to access it via a beta flag for instance? How can we know when something is/isn't added to the API?
For the API, the speech-to-text guide and the audio API reference provide the full documentation of the available features. These documents will be updated accordingly as we roll out new features.
Hi @IgnacioSan22, the custom DTW implementation in this PR was for the license issue as noted by others and also for the speed. An alternative is to use the timestamp predictions from the model, but we found that it's less reliable than using the attention patterns like in this PR. If you have solutions using any other algorithms for alignment, please let me know!
The community had some success handling hallucinations by preprocessing the inputs with VAD, like:
- A possible solution to Whisper hallucination #679
- Whisper WebUI with a VAD for more accurate non-English transcripts (Japanese) #397
- https://github.com/m-bain/whisperX
Hi @ioskevinshah, this feature is still experimental but we do plan to add it to the API as an option, once we're sure that it's reliable enough.
Hi @jongwook, I've tried the hungarian algorithm and in some cases the results are better, however due to the lack of resources I'm not capable to perform a proper study to find the best alingment algorithm. For hallucinations I've developed a postprocess functions that cleans the segments. It improves quite a lot, but I'll check those references.
Thanks
For the API, the speech-to-text guide and the audio API reference provide the full documentation of the available features. These documents will be updated accordingly as we roll out new features.
One more question: when will this new feature be rolled out?
Hi @IgnacioSan22, the custom DTW implementation in this PR was for the license issue as noted by others and also for the speed. An alternative is to use the timestamp predictions from the model, but we found that it's less reliable than using the attention patterns like in this PR. If you have solutions using any other algorithms for alignment, please let me know!
The community had some success handling hallucinations by preprocessing the inputs with VAD, like:
- A possible solution to Whisper hallucination #679
- Whisper WebUI with a VAD for more accurate non-English transcripts (Japanese) #397
- https://github.com/m-bain/whisperX
Hi @ioskevinshah, this feature is still experimental but we do plan to add it to the API as an option, once we're sure that it's reliable enough.
any workaround or logic after the API response?
This is awesome! Is there a way to pass in pre-transcribed text that whisper can use for more accurate alignment?