whisper Add option to carry initial_prompt with the sliding window

Background Whisper's transcribe() struggles with contextual proper nouns if they appear after the initial prompt has been consumed; see some experimental results here. This solves that issue by allowing the initial "context" prompt to be carried as the sliding window moves through the audio.

Changes Add an option carry_initial_prompt = False to whisper.transcribe().

When carry_initial_prompt is set to True, initial_prompt is prepended to each internal decode() call's prompt. If there is not enough context space at the start of the prompt, the prompt is left-sliced to make space.

Sep 18 '24 03:09 kittsil

There are outstanding issues with this PR:

I have not found the definition of the 224 context token length.
It prepends the initial_prompt to itself before enough tokens have been generated, resulting in a predilection toward looping.
I have not written tests.

Closing this PR since I can't find a way to move it to draft.

Sep 18 '24 04:09 kittsil

Closing this PR since I can't find a way to move it to draft.

How to: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/changing-the-stage-of-a-pull-request

Sep 18 '24 04:09 ryanheise

Also a relevant discussion here: https://github.com/openai/whisper/pull/1040#issuecomment-1457651898

I have not found the definition of the 224 context token length.

It's part of the model dimensions itself, actually 448 tokens total, and half that for the prompt. The logic is in decoding.py if you look for self.n_ctx: int = model.dims.n_text_ctx and look for the references to it.

Sep 18 '24 04:09 ryanheise

@ryanheise Thank you for your input; it was helpful. Do you mind providing any additional feedback?

Aside: I did find the left-slice in the code, and it turns out that the docs are wrong, as actually the maximum prompt length is 223!

Confirming with the medium.en model...

>>> medium = torch.load('/home/kittsil/.cache/whisper/medium.en.pt')
>>> medium['dims']
{'n_mels': 80, 'n_vocab': 51864, 'n_audio_ctx': 1500, 'n_audio_state': 1024, 'n_audio_head': 16, 'n_audio_layer': 24, 'n_text_ctx': 448, 'n_text_state': 1024, 'n_text_head': 16, 'n_text_layer': 24}
>>> medium['dims']['n_text_ctx'] // 2 - 1
223

Sep 19 '24 04:09 kittsil

hello if i locally merge this what do i add command to prevent whisper losing punctuation during transcription?

can you also update here so i can directly install it : https://github.com/kittsil/whisper/tree/patch-1

@kittsil

Oct 05 '24 15:10 FurkanGozukara

why this very important feature is still not merged @jongwook ?

Oct 22 '24 08:10 FurkanGozukara

@kittsil i use CLI so adding --carry_initial_prompt will work right?

Oct 22 '24 08:10 FurkanGozukara

I am transcribing a 3 hours video working awesome so far

how errors like this can be fixed?

Oct 22 '24 09:10 FurkanGozukara

how errors like this can be fixed?

@FurkanGozukara, that's an issue with whisper, not with your prompt. You can try setting compression_ratio_threshold lower; I have found some success with 1.7 (as opposed to the default 2.4).

In general, though, I wouldn't comment on a PR for debugging help; it's best to keep PRs focused on the request / review process.

Oct 23 '24 02:10 kittsil

@kittsil thank you so much your PR saved me so much

I transcribed this 3 hours video and without your PR I would be devastated because YouTube auto timing also failed :D

https://youtu.be/FvpWy1x5etM

Oct 25 '24 00:10 FurkanGozukara