whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Bulk repetition

Open garthk opened this issue 1 year ago • 2 comments

I'm not sure if this is a variant of #412, but check out this partial output:

[00:25:16.880 --> 00:25:20.240]   And you're like, this character needs some like thigh highs and like, it should have
[00:25:20.240 --> 00:25:21.240]   been a bit of a dresser.
[00:25:21.240 --> 00:25:22.240]   It should have been a dresser.
[00:25:22.240 --> 00:25:23.240]   It should have been a dresser.
[00:25:23.240 --> 00:25:24.240]   It should have been a dresser.
[00:25:24.240 --> 00:25:25.240]   It should have been a dresser.
[3333 additional repetitions elided]
[01:21:40.240 --> 01:21:41.240]   It should have been a dresser.
[01:21:41.240 --> 01:21:42.240]   It should have been a dresser.
[01:21:42.240 --> 01:21:43.240]   It should have been a dresser.
[01:21:43.240 --> 01:21:44.240]   It should have been a dresser.
[01:21:44.240 --> 01:21:45.240]   It should have been a dresser.
[01:21:45.240 --> 01:21:51.240]   Whether it's true or not is first and foremost a bluff to stop you from doing the right thing.

Reproduction:

./models/download-ggml-model.sh base.en
make
curl -o episode.mp3 -L https://mcdn.podbean.com/mf/web/5ein65/07-31-Clear-Present-free.mp3
ffmpeg -ar 16 -i episode.mp3 episode.wav
./main -f episode.wav 

Standard error:

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: processing 'episode.wav' (94221793 samples, 5888.9 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


whisper_print_timings:     fallbacks =   3 p /   9 h
whisper_print_timings:     load time =   120.07 ms
whisper_print_timings:      mel time =  8174.57 ms
whisper_print_timings:   sample time = 21253.98 ms / 46180 runs (    0.46 ms per run)
whisper_print_timings:   encode time = 84284.79 ms /   246 runs (  342.62 ms per run)
whisper_print_timings:   decode time = 139710.86 ms / 46321 runs (    3.02 ms per run)
whisper_print_timings:    total time = 253756.25 ms

I'm on the main branch at v1.2.0.

garthk avatar Feb 05 '23 02:02 garthk

Hi, thanks for the detailed steps - this helps a lot.

After debugging with WHISPER_DEBUG enabled I can see immediately that in this case, the entropy-based check for repetition didn't trigger. The cost function was just slightly above the default entropy threshold of 2.4:

whisper_full: decoder  0: score = -0.15161, result_len = 220, avg_logprobs = -0.15161, entropy =  2.44152
whisper_full: best decoder = 0
[00:25:20.240 --> 00:25:21.240]   been a bit of a dresser.
[00:25:21.240 --> 00:25:22.240]   It should have been a dresser.
[00:25:22.240 --> 00:25:23.240]   It should have been a dresser.
[00:25:23.240 --> 00:25:24.240]   It should have been a dresser.
[00:25:24.240 --> 00:25:25.240]   It should have been a dresser.
[00:25:25.240 --> 00:25:26.240]   It should have been a dresser.
[00:25:26.240 --> 00:25:27.240]   It should have been a dresser.
[00:25:27.240 --> 00:25:28.240]   It should have been a dresser.
[00:25:28.240 --> 00:25:29.240]   It should have been a dresser.
[00:25:29.240 --> 00:25:30.240]   It should have been a dresser.
[00:25:30.240 --> 00:25:31.240]   It should have been a dresser.
[00:25:31.240 --> 00:25:32.240]   It should have been a dresser.
[00:25:32.240 --> 00:25:33.240]   It should have been a dresser.
[00:25:33.240 --> 00:25:34.240]   It should have been a dresser.
[00:25:34.240 --> 00:25:35.240]   It should have been a dresser.
[00:25:35.240 --> 00:25:36.240]   It should have been a dresser.
[00:25:36.240 --> 00:25:37.240]   It should have been a dresser.
[00:25:37.240 --> 00:25:38.240]   It should have been a dresser.
[00:25:38.240 --> 00:25:39.240]   It should have been a dresser.
[00:25:39.240 --> 00:25:40.240]   It should have been a dresser.
[00:25:40.240 --> 00:25:41.240]   It should have been a dresser.
[00:25:41.240 --> 00:25:42.240]   It should have been a dresser.
seek = 154224, seek_delta = 2200

This means that the decoder didn't "detect" that there is a repetition and therefore didn't use the fallback strategy to correct it.

Rerunning the transcription with a slightly increased entropy threshold of --entropy-thold 2.5 resolves the issue.

Obviously, this is not a very nice approach since there is no way to normally see this debug information. But that is the general problem of having this kind of free parameters. The default values are not always going to work and might need a little tuning in some cases.

I'll try to think of some more robust way to detect the repetitions.

ggerganov avatar Feb 05 '23 06:02 ggerganov

That entropy threshold did the trick for that episode. Thanks!

garthk avatar Feb 06 '23 10:02 garthk