transformers
transformers copied to clipboard
feat: Whisper prompting
What does this PR do?
Closes #22395, thank you @sanchit-gandhi for the descriptive ask!
Adds the following functionality for Whisper prompting that is compatible with both model.generate()
and the pipeline (and includes accompanying tests). The scope expanded from the initial issue per the asks in the comments below.
- 3 new
model.generate()
params:-
prompt_ids
- Optional param of initial prompt ids to condition the first chunk inmodel.generate()
. -
condition_on_previous_text
- Whether or not to condition the a chunk's generated ids on the previously generated ids. Defaults to True to match the OpenAI Whisper implementation and can't be False whenprompt_ids
are provided -
always_use_initial_prompt
- Enables using only the prompt provided through theprompt_ids
param to condition the generation of all chunks. This is currently a feature request in a PR on the OpenAI Whisper repo linked in the comments below. Can't be True ifprompt_ids
aren't provided or ifcondition_on_previous_text
is False.
-
-
get_prompt_ids
Processor method to create initial prompt ids to pass to generate. - tokenizer
decode
properly removes the prompt ifskip_special_tokens=True
- tokenizer
_decode_asr
method for the pipeline always removes the prompt from the generated text
Example new API usage:
pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")
# Comments below represent the decoding of the generated tokens inside the pipeline with skip_special_tokens=False
# Also implemented for the `model.generate()` method
prompt_ids = processor.get_prompt_ids("")
pipe(samples, generate_kwargs={ "condition_on_previous_text": False, "prompt_ids": prompt_ids, "always_use_initial_prompt": False })
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilters' manner less interesting than his matter.<|endoftext|>
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>
prompt_ids = processor.get_prompt_ids("")
pipe(samples, generate_kwargs={ "condition_on_previous_text": True, "prompt_ids": prompt_ids, "always_use_initial_prompt": False })
# <|startofprev|><|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>
# <|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilter's manner less interesting than his matter.<|endoftext|>
# <|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>
prompt_ids = processor.get_prompt_ids("This is the initial prompt, and Mr. Quilter is one of the names in this conversation.")
pipe(samples, generate_kwargs={ "condition_on_previous_text": True, "prompt_ids": prompt_ids, "always_use_initial_prompt": True })
# <|startofprev|> This is the initial prompt, and Mr. Quilter is one of the names in this conversation.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>
# <|startofprev|> This is the initial prompt, and Mr. Quilter is one of the names in this conversation.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilter's manner, less interesting than his matter.<|endoftext|>
# <|startofprev|> This is the initial prompt, and Mr. Quilter is one of the names in this conversation.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings. Haven't added anywhere outside of documenting the new generate() arg directly on the function
- [x] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@sanchit-gandhi
The documentation is not available anymore as the PR was closed or merged.
Hey this PR looks really good (although I'll leave the actual review to Sanchit or Arthur).
I was just wondering whether it also makes sense to support the condition_on_previous_text
option that the OpenAI repo has, since that uses the same mechanism (using the <|startofprev|>
token).
In addition, there's this PR that suggests an always_use_initial_prompt
option that uses the prompt on every segment, not just the first. Might be useful to consider that here as well.
Hey this PR looks really good (although I'll leave the actual review to Sanchit or Arthur).
I was just wondering whether it also makes sense to support the
condition_on_previous_text
option that the OpenAI repo has, since that uses the same mechanism (using the<|startofprev|>
token).In addition, there's this PR that suggests an
always_use_initial_prompt
option that uses the prompt on every segment, not just the first. Might be useful to consider that here as well.
Hey Matthijs thanks, I'm happy to add what's wanted. Will look for HF guidance on that and whether it should be added here or in a follow on PR. temperature
was another factor I saw in the Whisper model, if it was > 0.5 no prompt tokens were added (link).
To-do list before re-requesting review
- [x] Converting the prompt token to an ID in an instance variable gives an incorrect ID, unlike when its called in decode
--Given we're only using it in two places and it's an inexpensive op to call
convert_tokens_to_ids
I've left this, at least for now, to focus more on the below - [x] Bug I found where if the ending text of the prompt matches the start of the transcribed text, that text will not be included in the transcription output. Example:
--I'm actually not sure this is a bug now. The model has learned to be penalized for repeating itself and this only happens if the end of the prompt matches the beginning of the transcription almost exactly. It also appears to be happening inside the model itself as opposed to in the logits processing or other modification before / after.
Added from @hollance's below two comments:
- [x] Add
always_use_initial_prompt
andcondition_on_previous_text
options to pipeline andmodel.generate()
- [x] Add prompting functionality to the
automatic-speech-recognition
pipeline
One more thing we'll need to do, is change the automatic-speech-recognition
pipeline so that it will actually call model.generate()
with the prompt, but only for the first chunk (or always if we also decide to support an always_use_initial_prompt
option). This logic cannot be part of the modeling code, as model.generate()
has no knowledge of which chunk of audio it's processing.
I looked a bit more into how this works today, and it turns out that 🤗 Transformers does things a bit differently than the original OpenAI code.
OpenAI does the following:
For the first 30-second chunk of audio, it passes the following token sequence to the model's decoder on the first iteration: <|startofprev|> initial prompt<|startoftranscript|><|en|><|transcribe|>
. And then it decodes the rest of the sequence autoregressively.
Then for the second chunk of audio, it passes the following sequence to the decoder on the first iteration: <|startofprev|> initial prompt output of the first chunk<|startoftranscript|><|en|><|transcribe|>
.
For the next chunk, it uses <|startofprev|> initial prompt output of the first chunk output of the second chunk<|startoftranscript|><|en|><|transcribe|>
And so on... This list of tokens that it passes in the <|startofprev|>
section grows longer and longer with each new chunk.
(When you set the condition_on_previous_text
option to False, it only uses the output from the previous chunk instead of the complete history. In that case the initial prompt text is only used for the very first chunk.)
Our ASR pipeline
works quite differently. It also splits up the audio in 30-second chunks but they partially overlap, and then it runs the model on these chunks in parallel. That makes it impossible to pass the previous context to these chunks, as each chunk is processed independently. So we have no way of sending <|startofprev|> initial prompt output of the first chunk<|startoftranscript|><|en|><|transcribe|>
to the second chunk.
The best we can do is send <|startofprev|> initial prompt<|startoftranscript|><|en|><|transcribe|>
to the very first chunk only, or always send it to all chunks. So we ignore the "previous context" part and always include the prompt. (The latter would do the same as this open PR on the OpenAI repo for always passing the initial prompt inside <|startofprev|>
instead of the previous context.)
The suggested modifications to model.generate()
in this PR make it possible to have both initial_prompt
and the condition_on_previous_text
options as in OpenAI, but it would require the user to write their own processing loop to get the same results as OpenAI. So we should definitely continue with this PR, but if we also want to support initial_prompt
in the pipeline
we'll have to decide on which approach we want. (It's not possible to have condition_on_previous_text
in the current pipeline.)
- We can provide a prompt in the pipeline like the below without modifying the pipeline at all, works for me locally. Is this sufficient / what you had in mind?
You are correct that when you do the following,
pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")
res = pipe(samples, generate_kwargs={ "prompt_ids": prompt_ids })
the pipeline will automatically pass the prompt_ids
to model.generate()
. However note that this pipeline only processes the first 30 seconds of the audio file. This is fine for audio that is shorter than 30 seconds.
However, to process an audio file that is longer than 30 seconds, we have to do:
res = pipe(example, generate_kwargs={ "prompt_ids": prompt_ids }, chunk_length_s=30, stride_length_s=[6, 0])
Now the same prompt_ids
are passed to model.generate()
for each 30-second chunk. In effect, this is the always_use_initial_prompt
option.
To get the regular initial_prompt
(i.e. always_use_initial_prompt
disabled) and condition_on_previous_text
behavior as they work in OpenAI with the current pipeline, we'd have to pass in a stride_length_s=[0,0]
and batch_size=1
to make the loop work sequentially rather than in parallel, and somehow keep track of the previous outputs.
Ok the additional requested features are now added so I believe this is ready for re-review. Thank you for your comments!
However note that this pipeline only processes the first 30 seconds of the audio file. This is fine for audio that is shorter than 30 seconds... In effect, this is the
always_use_initial_prompt
option.
I think I’m missing something here as I’ve tried this on >1 min of audio in the below example where I also added a debug line to decode the tokens inside of the pipeline as they were generated, and it appears to be properly sequential. In any case, if we don’t want this I’ll remove condition_on_previous_text
from the pipeline just lmk!
pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")
res = pipe(samples, generate_kwargs={ "condition_on_previous_text": True, "prompt_ids": prompt_ids })
# ['<|startofprev|><|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>']
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilter's manner less interesting than his matter.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca. Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man<|startoftranscript|><|en|><|transcribe|><|notimestamps|> it is obviously unnecessary for us to point out how luminous these criticisms are, how delicate and expression.<|endoftext|>"]
# ["<|startofprev|> middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca. Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man it is obviously unnecessary for us to point out how luminous these criticisms are, how delicate and expression.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> On the general principles of art and Mr. Quilter writes with equal lucidity.<|endoftext|>"]
The suggested modifications to model.generate() in this PR make it possible to have both initial_prompt and the condition_on_previous_text options as in OpenAI, but it would require the user to write their own processing loop to get the same results as OpenAI.
Aimed to address this with the new sequential loop over chunks of the input. Right now this way is incompatible with return_dict_in_generate
=True as I wasn't sure how / if we'd still want to several ModelOutputs, looking for guidance here.
Also, there are hacks in a few places related to getting the id of the prompt start token and separating it from the prompt text ids. Would this be something we could add to the model or generation config?
cc'ing in @gante re generate
- Add the prompt_ids to model.generate() as in your earlier version of the PR. All this does is insert the prompt in the <|startofprev|> section. This doesn't give us the OpenAI functionality yet, it only adds <|startofprev|> support to the modeling and tokenizer code.
Thanks @hollance I definitely agree splitting this into >1 PR is ideal, have pushed back up code for number 1 above so this can just address that portion. It now implicitly does always_use_initial_prompt
.
Curious if by adding return_tensors
to get_prompt_ids
you're setting up effectively doing condition_on_previous_text
via cleverly feeding batches / prompts to model.generate()
calls (i.e. the first chunk of a second model.generate call would use the text from the first chunk of the first model.generate call as a prompt and so on for each chunk in the batch), but that's more of a question for subsequent PRs
The reason I asked for the return_tensors
argument is that passing the prompt_ids
into model.generate()
as a torch.LongTensor
instead of List[int]
is more consistent with how we normally pass tokens into Transformers models. I understand that inside the model you might need turn it into a list anyway for the forced_decoder_ids
, but that's really an internal implementation detail. When we generate, the output token sequence is also a Tensor, and so we can concat this to the previous prompt_ids
to create the next one, etc. I hope that makes sense. :-)
All right, I think this all looks very good. Pinging @sanchit-gandhi for an additional review since he opened the issue.
Is there an estimation of when this branch will be merged?
Rebased to include tolerance increase for unrelated flaky flaky PT-FLAX whisper test
Thanks for the latest round of changes @connor-henderson! Kindly requesting a final review from @amyeroberts!
Since we're all happy with it, I'm pinging @amyeroberts from the core maintainers team to have a final look.
@amyeroberts @connor-henderson Hi All, Thank you for your great contribution, however I would like a raise a little concern. We tried to inference the model using this branch and the latest commit and got some weird results. We provide the audio sample in addition to the prompts for easy reproducing: WAV file link
code:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio
input_speech, sr = torchaudio.load(
"sample.wav"
)
model_name = "openai/whisper-medium"
processor = WhisperProcessor.from_pretrained(model_name, cache_dir="artifacts")
model = WhisperForConditionalGeneration.from_pretrained(model_name, cache_dir="artifacts")
input_features = processor(input_speech.squeeze(), sampling_rate=sr, return_tensors="pt").input_features
# --- Without prompt ---
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0], skip_special_tokens=False))
print(processor.decode(output_without_prompt[0], skip_special_tokens=True))
# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Mexico city")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0], skip_special_tokens=False))
print(processor.decode(output_with_prompt[0], skip_special_tokens=True))
and this is the trace:
<|startoftranscript|><|en|><|transcribe|><|notimestamps|> San Francisco educators. She was teaching in Mexico City.<|endoftext|>
San Francisco educators. She was teaching in Mexico City.
<|startofprev|> Mexico city<|startoftranscript|><|en|><|transcribe|><|notimestamps|> and<|endoftext|>
and
When we don't pass prompts we get the expected output, but when we do pass prompts (that appear in the transcription) we end up with a bad output.
Note that we did not commit any code changes before running this script.
System:
- pytorch 2.0.1
- The test was made on CPU
@AvivSham thanks for sharing, I looked at this and I think it may just be that prompting can be finicky. I believe the model perceives the prompt as previous context, so having 'Mexico city' be followed by 'San Francisco' with no grammar in between might've been viewed as unlikely by the model, which could then have led to further model confusion in successive generations.
I tried your example with the tiny model and the prompt actually corrected the output, and trying it with the medium Whisper model I was able to repro your issue but also address it by adding a period to the end of the prompt:
# --- Without prompt ---
output_without_prompt = model.generate(input_features)
print(processor.decode(output_with_prompt[0], skip_special_tokens=False))
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> San Francisco educators. She was teaching in Mexico City.<|endoftext|>
print(processor.decode(output_with_prompt[0], skip_special_tokens=True))
# San Francisco educators. She was teaching in Mexico City.
# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Mexico city.") # Added a period to the end
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0], skip_special_tokens=False))
# <|startofprev|> Mexico city.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> San Francisco educators. She was teaching in Mexico city.<|endoftext|>
print(processor.decode(output_with_prompt[0], skip_special_tokens=True))
# San Francisco educators. She was teaching in Mexico City.
Awesome - thanks for the reviews @amyeroberts and @gante, and for the fast iteration and detailed explanations from you @connor-henderson! Excited to see this PR merged when confirmed as ready 🤗
Regarding prompt engineering, my advice would by to try and emulate a full sentence, complete with punctuation and casing, since really what we're providing as the 'prompt' is just the target transcription from a previous window (see https://github.com/openai/whisper/discussions/963#discussioncomment-4987057)
Hi all, Thanks for the great work on adding prompt in 'model.generate'. Is it possible to add 'initial_prompt' in the Fine-Tune code with a 'prompt_use_rate' to control how often to add prompts to the sentences in training sets? So that we may improve the performance for some special prompts via prompt-tuning.
@AvivSham Thanks for reporting and @connor-henderson thanks for investigating!
I think we're good to merge 👍
Thank you so much for adding this! I've found that I occasionally get the following:
Traceback (most recent call last):
File "G:\Conda\hfwhisper\lib\site-packages\transformers\models\whisper\modeling_whisper.py", line 1662, in generate
return super().generate(
File "G:\Conda\hfwhisper\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "G:\Conda\hfwhisper\lib\site-packages\transformers\generation\utils.py", line 1518, in generate
return self.greedy_search(
File "G:\Conda\hfwhisper\lib\site-packages\transformers\generation\utils.py", line 2345, in greedy_search
next_token_logits = outputs.logits[:, -1, :]
IndexError: index -1 is out of bounds for dimension 1 with size 0
My workaround is to catch the exception and try again without the prompt_ids.
Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.
@Johnson-NLP
Is it possible to add 'initial_prompt' in the Fine-Tune code with a 'prompt_use_rate' to control how often to add prompts to the sentences in training sets?
Sounds like an interesting idea. Would you mind opening a new issue for this? Thanks!
To get prompting working with fine-tuning, we probably don't want to explicitly add 'prompted' examples per-se, but rather split longer examples up into shorter ones and feed them sequentially through the model, providing previous passages as 'context' to the model.
For example, if we had a training sample that looked like:
This is the first sentence. This is the second sentence. And finally, this is the third.
Currently what we do is feed it to the model all at once:
<|startoftranscript|> This is the first sentence. This is the second sentence. And finally, this is the third. <|endoftranscript|>
What we can do is feed the first sentence in:
<|startoftranscript|> This is the first sentence. <|endoftranscript|>
Then the second sentence, with the first sentence as context:
<|startofprev|> This is the first sentence.<|startoftranscript|> This is the second sentence. <|endoftranscript|>
And then the third, with both the first and second sentences as context:
<|startofprev|> This is the first sentence. This is the second sentence.<|startoftranscript|> And finally, this is the third.<|endoftranscript|>
At inference time, we then just provide the "context" as our prompts:
<|startofprev|> This is the prompt.<|startoftranscript|> (model generates the rest)
See section 2.3 of the Whisper paper for an in-depth explanation as to how they achieve this during pre-training. We essentially want to do the same for fine-tuning.
For this to work, ideally we need an original sentence that is >> 30s in duration. That way when we split it up, we don't have super short examples that we feed to the model.
Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.
I'll try reproducing in a small toy example. It's reproducible on my side with the fine-tuned large private model I've been working with.
Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.
The following triggers the bug on the 13th iterations of the loop. (Usually, it takes a lot more iterations.)
from datasets import load_dataset, DatasetDict
from transformers import WhisperForConditionalGeneration, WhisperProcessor
it = iter(load_dataset("librispeech_asr", "all", split="test.other", streaming=True))
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="English", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
prompt = 'some text rich in domain specific vocabulary lives here'
past_prompts = ["I am from the cutter lying off the coast"]
while it:
_ = [next(it) for x in range(3)]
clip = next(it)
input_features = processor(clip['audio']['array'], sampling_rate=clip['audio']['sampling_rate'], return_tensors="pt").input_features
prompt_ids = processor.get_prompt_ids(prompt + ' - ' + ' - '.join(past_prompts))
pred_ids = model.generate(input_features, language="english", task="transcribe", max_new_tokens=128, prompt_ids=prompt_ids)
result = processor.batch_decode(pred_ids, skip_special_tokens=True)[0].strip()
result_text = result.removesuffix('.')
print(result_text)
if result_text != '':
past_prompts.append(result_text)
if len(past_prompts) > 12:
past_prompts = past_prompts[1:]
@dgram0 thanks for sharing, I was able to repro this. As far as its relation to prompting I think this is another case of prompt sensitivity as opposed to a bug, but it may still be of interest with regards to Whisper generally since its the same error message as issue #22682.
I noticed that joining the prompts by ' - '
was causing the model to start predicting chinese characters, and using '. '
instead did not lead to the error (at least through 30 loops, at that point I stopped testing). I did notice degraded predictions over time though since a period did not necessarily belong after each result, and every now and again a chinese char was still predicted so. I'd just be cautious about how prompts are chained together.
@connor-henderson It's a bit of a contrived example meant just to recreate the issue without having to loop too much and at the same time show what may be considered a normal use case. Even without it predicting non-English characters or words you'll eventually encounter the issue within a few hundred loops.