whisper word-level timestamps in `transcribe()`

Jan 20 '23 20:01 jongwook

This DTW dependency introduces a licence incompatibility, but an alternative was suggested earlier in the discussions from memory.

Edit: Alternative library recommended in https://github.com/openai/whisper/discussions/813#discussioncomment-4617447

Jan 21 '23 00:01 ryanheise

Hi! I tried out this branch with kwargs['word_level_timestamps'] = True but the model performed very slowly. In addition (or rather because of) it started to hallucinate like mad. Im using chunks of short (couple of seconds) audio data in german produced by a VAD for live transcription.

Maybe its a problem on my side, maybe anyone can try to reproduce?

Jan 21 '23 08:01 KaiserChr

Thanks for the comments, all -- this is work in progress and not quite ready for merging. I'm trying to address both hallucination and performance concerns.

Jan 21 '23 08:01 jongwook

Yet another DTW implementation, fyi. Can't vouch for it other than to say that it is Apache licensed, recently updated, has both pure Python and C implementations.

https://github.com/wannesm/dtaidistance

Jan 24 '23 01:01 glangford

Hi, thanks for the great work!

I would like to ask if it is safe to swap to a smaller model (e.g. tiny) for world-level alignment to compute attention scores instead of using the same model (e.g. medium or large ) used to generate transcription. I suspect it could improve performance in terms of inference speed if this option would be supported.

Feb 11 '23 11:02 hojinYang

I found an interesting edge case with the small model where enabling the word-level timestamps option causes it to repeat the prompt at the end of the audio while also failing to infer the last word.

$ ffmpeg -t 29 -i https://audio2.redcircle.com/episodes/6b196013-8672-43d9-be52-4332b3207d93/stream.mp3 test.mp3

$ whisper --model small test.mp3
.../whisper/transcribe.py:98: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:15.920]  Military veteran Eric Weinstein began 69 Whiskey as a college radio show on 107.7 The
[00:15.920 --> 00:21.720]  Bronx, located on the campus of Ryder University in Lawrenceville, New Jersey.
[00:21.720 --> 00:27.560]  A show once restrained by rules and boundaries now comes straight to you raw, uncensored and
[00:27.560 --> 00:28.960]  unapologetic.

$ whisper --model small --output_format json --word_timestamps True test.mp3
.../whisper/transcribe.py:98: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:08.040 --> 00:15.940]  Military veteran Eric Weinstein began 69 Whiskey as a college radio show on 107.7 The
[00:15.940 --> 00:21.320]  Bronx, located on the campus of Ryder University in Lawrenceville, New Jersey.
[00:21.720 --> 00:28.980]  A show once restrained by rules and boundaries now comes straight to you raw, uncensored and
[00:28.960 --> 00:28.960]  Military veteran Eric Weinstein began 69 Whiskey as a college radio show on 107.7 The
[00:28.960 --> 00:28.960]  Bronx, located on the campus of Ryder University in Lawrenceville, New Jersey.
[00:28.960 --> 00:28.960]  A show once restrained by rules and boundaries now comes straight to you raw, uncensored and
[00:28.960 --> 00:28.960]  Military veteran Eric Weinstein began 69 Whiskey as a college radio show on 107.7 The
[00:28.960 --> 00:28.960]  Bronx, located on the campus of Ryder University in Lawrenceville, New Jersey.

Feb 25 '23 07:02 ryanheise

Hi @jongwook , Since you first release the notebook to obtain word-level timestamps I've been working on this to add to whisper process. And I've tried to test other alingment methods than DTW. Have you tried something else and found out that it works better?

Also, I've been struggling a lot with alucinations, specially for spanish content. I've create a cleaner function at segmet level, is there any smarter way?

Feb 27 '23 09:02 IgnacioSan22

is there any chance to have word level timestamps in Whisper API?

Mar 06 '23 12:03 ioskevinshah

Hi @IgnacioSan22, the custom DTW implementation in this PR was for the license issue as noted by others and also for the speed. An alternative is to use the timestamp predictions from the model, but we found that it's less reliable than using the attention patterns like in this PR. If you have solutions using any other algorithms for alignment, please let me know!

The community had some success handling hallucinations by preprocessing the inputs with VAD, like:

#679
#397
https://github.com/m-bain/whisperX

Hi @ioskevinshah, this feature is still experimental but we do plan to add it to the API as an option, once we're sure that it's reliable enough.

Mar 06 '23 22:03 jongwook

@jongwook is there a way to access it via a beta flag for instance? How can we know when something is/isn't added to the API?

Mar 08 '23 04:03 JeffreyWardman

For the API, the speech-to-text guide and the audio API reference provide the full documentation of the available features. These documents will be updated accordingly as we roll out new features.

Mar 08 '23 04:03 jongwook

Hi @IgnacioSan22, the custom DTW implementation in this PR was for the license issue as noted by others and also for the speed. An alternative is to use the timestamp predictions from the model, but we found that it's less reliable than using the attention patterns like in this PR. If you have solutions using any other algorithms for alignment, please let me know!

The community had some success handling hallucinations by preprocessing the inputs with VAD, like:

A possible solution to Whisper hallucination #679

Whisper WebUI with a VAD for more accurate non-English transcripts (Japanese) #397

https://github.com/m-bain/whisperX

Hi @ioskevinshah, this feature is still experimental but we do plan to add it to the API as an option, once we're sure that it's reliable enough.

Hi @jongwook, I've tried the hungarian algorithm and in some cases the results are better, however due to the lack of resources I'm not capable to perform a proper study to find the best alingment algorithm. For hallucinations I've developed a postprocess functions that cleans the segments. It improves quite a lot, but I'll check those references.

Thanks

Mar 08 '23 10:03 IgnacioSan22

For the API, the speech-to-text guide and the audio API reference provide the full documentation of the available features. These documents will be updated accordingly as we roll out new features.

One more question: when will this new feature be rolled out?

Mar 15 '23 06:03 glinft

Hi @IgnacioSan22, the custom DTW implementation in this PR was for the license issue as noted by others and also for the speed. An alternative is to use the timestamp predictions from the model, but we found that it's less reliable than using the attention patterns like in this PR. If you have solutions using any other algorithms for alignment, please let me know!

The community had some success handling hallucinations by preprocessing the inputs with VAD, like:

A possible solution to Whisper hallucination #679

Whisper WebUI with a VAD for more accurate non-English transcripts (Japanese) #397

https://github.com/m-bain/whisperX

Hi @ioskevinshah, this feature is still experimental but we do plan to add it to the API as an option, once we're sure that it's reliable enough.

any workaround or logic after the API response?

Mar 23 '23 13:03 ioskevinshah

This is awesome! Is there a way to pass in pre-transcribed text that whisper can use for more accurate alignment?

Jun 15 '23 02:06 samuelbradshaw

whisper whisper copied to clipboard

word-level timestamps in `transcribe()`

whisper
whisper copied to clipboard