whisper
whisper copied to clipboard
add hotwords feature
hello! During the transcription process, I often encounter some proprietary or new vocabulary, and Whisper cannot handle it well. I searched for solutions, and the community provided two options:
Fine-tuning the model: This approach is costly, and it's not practical to fine-tune the model every time a new term emerges.
Using initial_prompt: However, initial_prompt only applies to the first window. If specialized terms don't appear at the beginning, this method is ineffective.
Upon reviewing other transcription models, it's common practice to use hotwords. So, I implemented this feature. My approach is to add hotword-related prompts before each transcription window. Since there's a maximum length limit, I occupy the space previously used by the prefix. When the prefix isn't set, hotwords take effect. After testing, it indeed resolved the issue of specialized vocabulary in my scenario.
The following is the community discussion on this issue: https://github.com/openai/whisper/discussions/1477 https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311 https://stackoverflow.com/questions/73833916/how-can-i-give-some-hint-phrases-to-openais-whisper-asr
@jongwook hello, please check out this pr.
Would this be a duplicated effort since there is a parameter that serves the same purpose, condition_on_previous_text
? if condition_on_previous_text
set to True
, the previous output of the model is provided as a prompt for the next window. Correct me if I'm wrong. Thank you.
@James-Shared-Studios This isn't used to add context, it's used to add hot words when some new word or term comes up that makes whisper recognize it. for example:comfyUI is a new word, it is The most powerful and modular stable diffusion GUI and backend.If don't add hotwords, he won't be recognized correctly.
Have tried it with a video where the following words were misspelled
"Kalichain"
=>
"cl chain"
"cali chain"
"Kalicertif"
=>
"c cerff"
"cl ciff"
"Cali certif"
"Kalismarket"
=>
"C's Market"
"Kalishare"
=>
"Cali share"
"Kalistoken"
=>
"Cali's token"
"kijiji"
=>
"kiji"
And indeed it worked to make it so that these words were no longer misspelled with the following args:
whisper video.opus --hotwords "Kalichain, Kalicertif, Kalismarket, Kalishare, Kalistoken, kijiji, MEXC, Kalissa, FireHustle"
But it didn't work 100%, sometimes they were misspelled. Notably Kalicertif
was misspelled as Kalistertif
.
So, by inputting a series of proper nouns through the hotwords method, what is the maximum length that can actually be supported? @jax-explorer
@JiweiZh It depends on the n_text_ctx value in the model's dims.
@jax-explorer Hello, I find this commit very useful and hope this going to be merged soon. Currently, I'm using your forked repository to enjoy this feature. BTW, I have some questions about your implementation.
- You say that you occupy spaces for
prefix
, but I'm not sure where theprefix
comes from. Iscondition_on_previous_text
related toprefix
? - Current implementation divide
n_ctx
by 2 and assignprompt
andhotwords
evenly. If I want to usehotwords
more, is it valid to changen_ctx // 2
to some other numbers? For example, I would not useprompt
and usehotwords
only if we providehotwords
like below:
if (hotwords := self.options.hotwords) is not None:
hotwords_tokens = self.tokenizer.encode(" " + hotwords.strip())
hotwords_tokens = hotwords_tokens[: self.n_ctx] # Use more hotwords
tokens = (
[self.tokenizer.sot_prev]
+ hotwords_tokens
# + (prompt_tokens[-(self.n_ctx // 2 - 1) :] if self.options.prompt is not None else [])
+ tokens
)
Thanks!