whisper.cpp Improving timestamps for words

Hi,

Wanted to start off with a big thanks for porting whisper to C++. It has been very useful for integration on iOS.

I wanted to open this issue to see if you had any thoughts or suggestions on how to address an issue we're seeing in one of our sample audio files.

Roughly 56 seconds into the audio file, one of the people says 'What?' after a longish pause (6-7 seconds) since the previous word.

Running in through whisper.cpp:

[49.9 --> 50.47] | What| (Confidence: 0.83800673)

Running it through openai/whisper:

>>> stab_segments[-3]
{'id': 26, 'seek': 2840, 'start': 49.4, 'end': 50.4, 'text': ' What?', 'tokens': [708, 30], 'temperature': 0.0, 'avg_logprob': -0.36360436898690685, 'compression_ratio': 1.6474820143884892, 'no_speech_prob': 0.00023565757146570832}

Running it through jianfch/stable-ts:

>>> results['segments'][-2]['whole_word_timestamps']
[{'word': ' What?', 'timestamp': 56.8799991607666}]

So it looks like stable-ts has made some changes that properly detects the timing of "What?". Looks like stable-ts has some silence detection that is likely aiding it in this scenario.

So my question is is there anything I can do on the input side to help improve these scenarios? Is the only solution adjusting the core logic in the library itself? Are some of the improvements in stable-ts scheduled to be added to this repo as well?

I have attached the sample audio file. (GitHub didn't like the wav directly so I zipped it up).

bad_caption_timing.wav.zip

Thanks again.

Dec 13 '22 17:12 akatkov7

I'm not sure what is the best way to fix this. There is however an interesting effect that I have observed and is easily reproduced with your audio file. To demonstrate here are 4 different results with whisper.cpp for different starting offsets (0 ms, 100 ms, 200 ms, 300 ms):

offset = 0 ms (default)

$  ./main -m ./models/ggml-base.en.bin -f ./bad_caption_timing.wav
whisper_model_load: loading model from './models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing './bad_caption_timing.wav' (987847 samples, 61.7 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:05.340]   You're going to start feeling it better and better and better.
[00:00:05.340 --> 00:00:08.300]   See the work we're doing now is essential.
[00:00:08.300 --> 00:00:10.100]   But feel that, exactly.
[00:00:10.100 --> 00:00:11.900]   Connect and boom.
[00:00:11.900 --> 00:00:13.900]   Yeah, there you go.
[00:00:13.900 --> 00:00:14.900]   Tuck in.
[00:00:14.900 --> 00:00:15.900]   More.
[00:00:15.900 --> 00:00:16.900]   Straight legs.
[00:00:16.900 --> 00:00:17.900]   Feet together.
[00:00:17.900 --> 00:00:18.900]   Tuck in.
[00:00:18.900 --> 00:00:19.900]   More.
[00:00:19.900 --> 00:00:20.900]   Yeah, here.
[00:00:20.900 --> 00:00:22.900]   And now we track here.
[00:00:22.900 --> 00:00:23.900]   Yeah, there you go.
[00:00:23.900 --> 00:00:24.900]   Stay there.
[00:00:24.900 --> 00:00:25.900]   Stay there.
[00:00:25.900 --> 00:00:26.900]   There you go.
[00:00:26.900 --> 00:00:27.900]   Breathe.
[00:00:27.900 --> 00:00:30.900]   There you go.
[00:00:30.900 --> 00:00:33.900]   There you go.
[00:00:33.900 --> 00:00:35.900]   That's good.
[00:00:35.900 --> 00:00:36.900]   Better.
[00:00:36.900 --> 00:00:37.900]   That's a brutal part.
[00:00:37.900 --> 00:00:39.900]   That's good right there.
[00:00:39.900 --> 00:00:40.900]   You got it.
[00:00:40.900 --> 00:00:41.900]   Forward.
[00:00:41.900 --> 00:00:42.900]   Let's go.
[00:00:42.900 --> 00:00:43.900]   Exhale.
[00:00:43.900 --> 00:00:44.900]   Let's go.
[00:00:44.900 --> 00:00:45.900]   Yeah, let's go.
[00:00:45.900 --> 00:00:46.900]   Tuck in.
[00:00:46.900 --> 00:00:47.900]   Tuck in.
[00:00:47.900 --> 00:00:48.900]   Yeah, there you go.
[00:00:48.900 --> 00:00:49.900]   Boom.
[00:00:49.900 --> 00:00:50.900]   What?
[00:00:50.900 --> 00:00:51.900]   Let's go.
[00:00:51.900 --> 00:00:58.900]   Nice.
[00:00:58.900 --> 00:01:00.900]   It's a...


whisper_print_timings:     load time =   109.74 ms
whisper_print_timings:      mel time =    92.28 ms
whisper_print_timings:   sample time =    32.94 ms
whisper_print_timings:   encode time =  1295.63 ms / 215.94 ms per layer
whisper_print_timings:   decode time =  1124.42 ms / 187.40 ms per layer
whisper_print_timings:    total time =  2656.81 ms

offset = 100 ms

$  ./main -m ./models/ggml-base.en.bin -f ./bad_caption_timing.wav -ot 100
whisper_model_load: loading model from './models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing './bad_caption_timing.wav' (987847 samples, 61.7 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.100 --> 00:00:05.440]   You're going to start feeling it better and better and better.
[00:00:05.440 --> 00:00:08.260]   See the work we're doing now is essential.
[00:00:08.260 --> 00:00:11.540]   We feel that, exactly, connect and boom.
[00:00:11.540 --> 00:00:13.540]   Yeah, there you go.
[00:00:13.540 --> 00:00:15.700]   Tuck in, more.
[00:00:15.700 --> 00:00:18.100]   Straight legs, fit together.
[00:00:18.100 --> 00:00:20.180]   Tuck in, more.
[00:00:20.180 --> 00:00:22.940]   Yeah, here and now retract here.
[00:00:22.940 --> 00:00:24.140]   Yeah, there you go.
[00:00:24.140 --> 00:00:25.940]   Stay there, stay there.
[00:00:25.940 --> 00:00:27.140]   There you go, breathe.
[00:00:27.140 --> 00:00:32.140]   There you go, there you go.
[00:00:32.140 --> 00:00:33.140]   Better.
[00:00:33.140 --> 00:00:38.140]   That's a brutal part.
[00:00:38.140 --> 00:00:40.140]   That's good right there.
[00:00:40.140 --> 00:00:41.140]   You got it.
[00:00:41.140 --> 00:00:43.140]   Forward, let's go, accept.
[00:00:43.140 --> 00:00:45.140]   Let's go, accept.
[00:00:45.140 --> 00:00:46.140]   Yeah, let's go.
[00:00:46.140 --> 00:00:49.140]   Tuck in, tuck in, yeah, there you go.
[00:00:49.140 --> 00:00:50.140]   Boom.
[00:00:50.140 --> 00:00:51.140]   What?
[00:00:51.140 --> 00:00:58.140]   Let's go.
[00:00:58.140 --> 00:01:01.140]   Nice.


whisper_print_timings:     load time =   113.64 ms
whisper_print_timings:      mel time =    98.95 ms
whisper_print_timings:   sample time =    28.41 ms
whisper_print_timings:   encode time =  1046.77 ms / 174.46 ms per layer
whisper_print_timings:   decode time =   922.95 ms / 153.82 ms per layer
whisper_print_timings:    total time =  2212.26 ms

offset = 200 ms

$  ./main -m ./models/ggml-base.en.bin -f ./bad_caption_timing.wav -ot 200
whisper_model_load: loading model from './models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing './bad_caption_timing.wav' (987847 samples, 61.7 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.200 --> 00:00:05.540]   You're going to start feeling it better and better and better.
[00:00:05.540 --> 00:00:08.360]   See the work we're doing now is essential.
[00:00:08.360 --> 00:00:16.640]   The feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel,
[00:00:16.640 --> 00:00:23.720]   the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel,
[00:00:23.720 --> 00:00:28.720]   the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel,
[00:00:28.720 --> 00:00:35.720]   the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel,
[00:00:35.720 --> 00:00:43.720]   the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel,
[00:00:43.720 --> 00:00:51.720]   the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel,
[00:00:51.720 --> 00:00:57.720]   the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel,
[00:00:57.720 --> 00:01:03.720]   the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel, the feel,


whisper_print_timings:     load time =   120.85 ms
whisper_print_timings:      mel time =    92.84 ms
whisper_print_timings:   sample time =    35.38 ms
whisper_print_timings:   encode time =  1014.18 ms / 169.03 ms per layer
whisper_print_timings:   decode time =  1093.42 ms / 182.24 ms per layer
whisper_print_timings:    total time =  2358.17 ms

offset = 300 ms

$  ./main -m ./models/ggml-base.en.bin -f ./bad_caption_timing.wav -ot 300
whisper_model_load: loading model from './models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing './bad_caption_timing.wav' (987847 samples, 61.7 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.300 --> 00:00:05.580]   You're going to start feeling it better and better and better.
[00:00:05.580 --> 00:00:08.460]   See the work we're doing now is essential.
[00:00:08.460 --> 00:00:10.100]   But feel that, exactly.
[00:00:10.100 --> 00:00:11.740]   Connect and boom.
[00:00:11.740 --> 00:00:14.580]   Yeah, there you go.
[00:00:14.580 --> 00:00:15.580]   Tuck in.
[00:00:15.580 --> 00:00:16.580]   More.
[00:00:16.580 --> 00:00:17.580]   Straight legs.
[00:00:17.580 --> 00:00:18.580]   Feet together.
[00:00:18.580 --> 00:00:19.580]   Tuck in.
[00:00:19.580 --> 00:00:20.580]   More.
[00:00:20.580 --> 00:00:21.580]   Yeah, here.
[00:00:21.580 --> 00:00:22.580]   And now we track here.
[00:00:22.580 --> 00:00:23.580]   Yeah, there you go.
[00:00:23.580 --> 00:00:24.580]   Stay there.
[00:00:24.580 --> 00:00:25.580]   Stay there.
[00:00:25.580 --> 00:00:26.580]   There you go.
[00:00:26.580 --> 00:00:27.580]   Breathe.
[00:00:27.580 --> 00:00:32.580]   There you go.
[00:00:32.580 --> 00:00:34.580]   There you go.
[00:00:34.580 --> 00:00:35.580]   It's good.
[00:00:35.580 --> 00:00:36.580]   Better.
[00:00:36.580 --> 00:00:37.580]   That's a brutal part.
[00:00:37.580 --> 00:00:39.580]   That's good right there.
[00:00:39.580 --> 00:00:40.580]   You got it.
[00:00:40.580 --> 00:00:41.580]   Forward.
[00:00:41.580 --> 00:00:42.580]   Let's go.
[00:00:42.580 --> 00:00:43.580]   Exhale.
[00:00:43.580 --> 00:00:44.580]   Let's go.
[00:00:44.580 --> 00:00:45.580]   Yeah, let's go.
[00:00:45.580 --> 00:00:46.580]   Tuck in.
[00:00:46.580 --> 00:00:47.580]   Tuck in.
[00:00:47.580 --> 00:00:48.580]   Yeah, there you go.
[00:00:48.580 --> 00:00:49.580]   Boom.
[00:00:49.580 --> 00:00:50.580]   What?
[00:00:50.580 --> 00:00:51.580]   That's good.
[00:00:51.580 --> 00:00:57.580]   What?
[00:00:57.580 --> 00:00:58.580]   Let's go.
[00:00:58.580 --> 00:00:59.580]   Nice.
[00:00:59.580 --> 00:01:01.580]   That's it.


whisper_print_timings:     load time =   124.40 ms
whisper_print_timings:      mel time =    93.07 ms
whisper_print_timings:   sample time =    34.79 ms
whisper_print_timings:   encode time =   967.82 ms / 161.30 ms per layer
whisper_print_timings:   decode time =  1026.30 ms / 171.05 ms per layer
whisper_print_timings:    total time =  2247.85 ms

As you can see, a very minor change in the initial offset of the audio processing can lead to dramatic variations in the transcription result and the detected timestamps (when using the Greedy decoder). In the last case for example, the timing of the What? is better (but still not correct).

So yeah, not sure what this means exactly, but I feel one should try to understand better why this variation occurs and try to make a more robust decoding strategy and from there hopefully better timestamp generation.

Dec 13 '22 21:12 ggerganov

Is it possible to port forced alignment with phoneme-based ASR models? https://github.com/m-bain/whisperX

Dec 26 '22 14:12 gut4

@gut4 This is too complicated for the whisper.cpp project so likely won't be included.

There are some alternative ideas that might help with word timestamps in the future. Keep an eye on #291

Dec 29 '22 11:12 ggerganov

+1 on this. I have many scenarios where pauses exist in audio. In my scenario, as an example, a word may be detected 3.8 seconds into the start of an audio transcription, but the start time whisper returns is 0 microseconds. A handful of tokens later, the start time corrects it's alignment. This makes it a bit tricky to determine when the timestamps realign with the audio. This seems to be primarily impacted by long periods of pauses. Attempting to offset the alignment issue with a VAD model is helpful but not perfect.

Jul 04 '23 00:07 chriskyndrid

I'm running into this issue as well. If theres pauses, or the audio starts several seconds in, it "stretches" it over that silent part. I'm not sure where to even look in the code to start trying to debug this.

Jul 05 '23 21:07 bmurray

whisper.cpp whisper.cpp copied to clipboard

Improving timestamps for words

whisper.cpp
whisper.cpp copied to clipboard