transformers icon indicating copy to clipboard operation
transformers copied to clipboard

feat: Whisper prompting

Open connor-henderson opened this issue 1 year ago • 12 comments

What does this PR do?

Closes #22395, thank you @sanchit-gandhi for the descriptive ask!

Adds the following functionality for Whisper prompting that is compatible with both model.generate() and the pipeline (and includes accompanying tests). The scope expanded from the initial issue per the asks in the comments below.

  • 3 new model.generate() params:
    • prompt_ids - Optional param of initial prompt ids to condition the first chunk in model.generate().
    • condition_on_previous_text - Whether or not to condition the a chunk's generated ids on the previously generated ids. Defaults to True to match the OpenAI Whisper implementation and can't be False when prompt_ids are provided
    • always_use_initial_prompt - Enables using only the prompt provided through the prompt_ids param to condition the generation of all chunks. This is currently a feature request in a PR on the OpenAI Whisper repo linked in the comments below. Can't be True if prompt_ids aren't provided or if condition_on_previous_text is False.
  • get_prompt_ids Processor method to create initial prompt ids to pass to generate.
  • tokenizer decode properly removes the prompt if skip_special_tokens=True
  • tokenizer _decode_asr method for the pipeline always removes the prompt from the generated text

Example new API usage:

pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")

# Comments below represent the decoding of the generated tokens inside the pipeline with skip_special_tokens=False
# Also implemented for the `model.generate()` method 

prompt_ids = processor.get_prompt_ids("")
pipe(samples, generate_kwargs={ "condition_on_previous_text": False, "prompt_ids": prompt_ids, "always_use_initial_prompt": False })
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilters' manner less interesting than his matter.<|endoftext|>
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>

prompt_ids = processor.get_prompt_ids("")
pipe(samples, generate_kwargs={ "condition_on_previous_text": True, "prompt_ids": prompt_ids, "always_use_initial_prompt": False })
# <|startofprev|><|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>
# <|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilter's manner less interesting than his matter.<|endoftext|>
# <|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>

prompt_ids = processor.get_prompt_ids("This is the initial prompt, and Mr. Quilter is one of the names in this conversation.")
pipe(samples, generate_kwargs={ "condition_on_previous_text": True, "prompt_ids": prompt_ids, "always_use_initial_prompt": True })
# <|startofprev|> This is the initial prompt, and Mr. Quilter is one of the names in this conversation.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>
# <|startofprev|> This is the initial prompt, and Mr. Quilter is one of the names in this conversation.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilter's manner, less interesting than his matter.<|endoftext|>
# <|startofprev|> This is the initial prompt, and Mr. Quilter is one of the names in this conversation.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings. Haven't added anywhere outside of documenting the new generate() arg directly on the function
  • [x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@sanchit-gandhi

connor-henderson avatar Mar 31 '23 15:03 connor-henderson

The documentation is not available anymore as the PR was closed or merged.

Hey this PR looks really good (although I'll leave the actual review to Sanchit or Arthur).

I was just wondering whether it also makes sense to support the condition_on_previous_text option that the OpenAI repo has, since that uses the same mechanism (using the <|startofprev|> token).

In addition, there's this PR that suggests an always_use_initial_prompt option that uses the prompt on every segment, not just the first. Might be useful to consider that here as well.

hollance avatar Apr 03 '23 09:04 hollance

Hey this PR looks really good (although I'll leave the actual review to Sanchit or Arthur).

I was just wondering whether it also makes sense to support the condition_on_previous_text option that the OpenAI repo has, since that uses the same mechanism (using the <|startofprev|> token).

In addition, there's this PR that suggests an always_use_initial_prompt option that uses the prompt on every segment, not just the first. Might be useful to consider that here as well.

Hey Matthijs thanks, I'm happy to add what's wanted. Will look for HF guidance on that and whether it should be added here or in a follow on PR. temperature was another factor I saw in the Whisper model, if it was > 0.5 no prompt tokens were added (link).

connor-henderson avatar Apr 03 '23 12:04 connor-henderson

To-do list before re-requesting review

  • [x] Converting the prompt token to an ID in an instance variable gives an incorrect ID, unlike when its called in decode --Given we're only using it in two places and it's an inexpensive op to call convert_tokens_to_ids I've left this, at least for now, to focus more on the below
  • [x] Bug I found where if the ending text of the prompt matches the start of the transcribed text, that text will not be included in the transcription output. Example: --I'm actually not sure this is a bug now. The model has learned to be penalized for repeating itself and this only happens if the end of the prompt matches the beginning of the transcription almost exactly. It also appears to be happening inside the model itself as opposed to in the logits processing or other modification before / after. Screenshot 2023-04-05 at 1 14 03 AM

Added from @hollance's below two comments:

  • [x] Add always_use_initial_prompt and condition_on_previous_text options to pipeline and model.generate()
  • [x] Add prompting functionality to the automatic-speech-recognition pipeline

connor-henderson avatar Apr 05 '23 05:04 connor-henderson

One more thing we'll need to do, is change the automatic-speech-recognition pipeline so that it will actually call model.generate() with the prompt, but only for the first chunk (or always if we also decide to support an always_use_initial_prompt option). This logic cannot be part of the modeling code, as model.generate() has no knowledge of which chunk of audio it's processing.

hollance avatar Apr 05 '23 09:04 hollance

I looked a bit more into how this works today, and it turns out that 🤗 Transformers does things a bit differently than the original OpenAI code.

OpenAI does the following:

For the first 30-second chunk of audio, it passes the following token sequence to the model's decoder on the first iteration: <|startofprev|> initial prompt<|startoftranscript|><|en|><|transcribe|>. And then it decodes the rest of the sequence autoregressively.

Then for the second chunk of audio, it passes the following sequence to the decoder on the first iteration: <|startofprev|> initial prompt output of the first chunk<|startoftranscript|><|en|><|transcribe|>.

For the next chunk, it uses <|startofprev|> initial prompt output of the first chunk output of the second chunk<|startoftranscript|><|en|><|transcribe|>

And so on... This list of tokens that it passes in the <|startofprev|> section grows longer and longer with each new chunk.

(When you set the condition_on_previous_text option to False, it only uses the output from the previous chunk instead of the complete history. In that case the initial prompt text is only used for the very first chunk.)

Our ASR pipeline works quite differently. It also splits up the audio in 30-second chunks but they partially overlap, and then it runs the model on these chunks in parallel. That makes it impossible to pass the previous context to these chunks, as each chunk is processed independently. So we have no way of sending <|startofprev|> initial prompt output of the first chunk<|startoftranscript|><|en|><|transcribe|> to the second chunk.

The best we can do is send <|startofprev|> initial prompt<|startoftranscript|><|en|><|transcribe|> to the very first chunk only, or always send it to all chunks. So we ignore the "previous context" part and always include the prompt. (The latter would do the same as this open PR on the OpenAI repo for always passing the initial prompt inside <|startofprev|> instead of the previous context.)

The suggested modifications to model.generate() in this PR make it possible to have both initial_prompt and the condition_on_previous_text options as in OpenAI, but it would require the user to write their own processing loop to get the same results as OpenAI. So we should definitely continue with this PR, but if we also want to support initial_prompt in the pipeline we'll have to decide on which approach we want. (It's not possible to have condition_on_previous_text in the current pipeline.)

hollance avatar Apr 05 '23 16:04 hollance

  • We can provide a prompt in the pipeline like the below without modifying the pipeline at all, works for me locally. Is this sufficient / what you had in mind?

You are correct that when you do the following,

pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")
res = pipe(samples, generate_kwargs={ "prompt_ids": prompt_ids })

the pipeline will automatically pass the prompt_ids to model.generate(). However note that this pipeline only processes the first 30 seconds of the audio file. This is fine for audio that is shorter than 30 seconds.

However, to process an audio file that is longer than 30 seconds, we have to do:

res = pipe(example, generate_kwargs={ "prompt_ids": prompt_ids }, chunk_length_s=30, stride_length_s=[6, 0])

Now the same prompt_ids are passed to model.generate() for each 30-second chunk. In effect, this is the always_use_initial_prompt option.

To get the regular initial_prompt (i.e. always_use_initial_prompt disabled) and condition_on_previous_text behavior as they work in OpenAI with the current pipeline, we'd have to pass in a stride_length_s=[0,0] and batch_size=1 to make the loop work sequentially rather than in parallel, and somehow keep track of the previous outputs.

hollance avatar Apr 11 '23 09:04 hollance

Ok the additional requested features are now added so I believe this is ready for re-review. Thank you for your comments!

However note that this pipeline only processes the first 30 seconds of the audio file. This is fine for audio that is shorter than 30 seconds... In effect, this is the always_use_initial_prompt option.

I think I’m missing something here as I’ve tried this on >1 min of audio in the below example where I also added a debug line to decode the tokens inside of the pipeline as they were generated, and it appears to be properly sequential. In any case, if we don’t want this I’ll remove condition_on_previous_text from the pipeline just lmk!

pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")
res = pipe(samples, generate_kwargs={ "condition_on_previous_text": True, "prompt_ids": prompt_ids })
# ['<|startofprev|><|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>']
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilter's manner less interesting than his matter.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca. Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man<|startoftranscript|><|en|><|transcribe|><|notimestamps|> it is obviously unnecessary for us to point out how luminous these criticisms are, how delicate and expression.<|endoftext|>"]
# ["<|startofprev|> middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca. Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man it is obviously unnecessary for us to point out how luminous these criticisms are, how delicate and expression.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> On the general principles of art and Mr. Quilter writes with equal lucidity.<|endoftext|>"]


The suggested modifications to model.generate() in this PR make it possible to have both initial_prompt and the condition_on_previous_text options as in OpenAI, but it would require the user to write their own processing loop to get the same results as OpenAI.

Aimed to address this with the new sequential loop over chunks of the input. Right now this way is incompatible with return_dict_in_generate=True as I wasn't sure how / if we'd still want to several ModelOutputs, looking for guidance here.

Also, there are hacks in a few places related to getting the id of the prompt start token and separating it from the prompt text ids. Would this be something we could add to the model or generation config?

connor-henderson avatar Apr 14 '23 04:04 connor-henderson

cc'ing in @gante re generate

amyeroberts avatar Apr 24 '23 13:04 amyeroberts

  1. Add the prompt_ids to model.generate() as in your earlier version of the PR. All this does is insert the prompt in the <|startofprev|> section. This doesn't give us the OpenAI functionality yet, it only adds <|startofprev|> support to the modeling and tokenizer code.

Thanks @hollance I definitely agree splitting this into >1 PR is ideal, have pushed back up code for number 1 above so this can just address that portion. It now implicitly does always_use_initial_prompt.

connor-henderson avatar Apr 25 '23 21:04 connor-henderson

Curious if by adding return_tensors to get_prompt_ids you're setting up effectively doing condition_on_previous_text via cleverly feeding batches / prompts to model.generate() calls (i.e. the first chunk of a second model.generate call would use the text from the first chunk of the first model.generate call as a prompt and so on for each chunk in the batch), but that's more of a question for subsequent PRs

connor-henderson avatar Apr 25 '23 21:04 connor-henderson

The reason I asked for the return_tensors argument is that passing the prompt_ids into model.generate() as a torch.LongTensor instead of List[int] is more consistent with how we normally pass tokens into Transformers models. I understand that inside the model you might need turn it into a list anyway for the forced_decoder_ids, but that's really an internal implementation detail. When we generate, the output token sequence is also a Tensor, and so we can concat this to the previous prompt_ids to create the next one, etc. I hope that makes sense. :-)

hollance avatar Apr 26 '23 08:04 hollance

All right, I think this all looks very good. Pinging @sanchit-gandhi for an additional review since he opened the issue.

hollance avatar May 08 '23 09:05 hollance

Is there an estimation of when this branch will be merged?

AvivSham avatar May 10 '23 17:05 AvivSham

Rebased to include tolerance increase for unrelated flaky flaky PT-FLAX whisper test

connor-henderson avatar May 15 '23 13:05 connor-henderson

Thanks for the latest round of changes @connor-henderson! Kindly requesting a final review from @amyeroberts!

sanchit-gandhi avatar May 15 '23 16:05 sanchit-gandhi

Since we're all happy with it, I'm pinging @amyeroberts from the core maintainers team to have a final look.

hollance avatar May 17 '23 11:05 hollance

@amyeroberts @connor-henderson Hi All, Thank you for your great contribution, however I would like a raise a little concern. We tried to inference the model using this branch and the latest commit and got some weird results. We provide the audio sample in addition to the prompts for easy reproducing: WAV file link

code:

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio


input_speech, sr = torchaudio.load(
    "sample.wav"
)
model_name = "openai/whisper-medium"
processor = WhisperProcessor.from_pretrained(model_name, cache_dir="artifacts")
model = WhisperForConditionalGeneration.from_pretrained(model_name, cache_dir="artifacts")
input_features = processor(input_speech.squeeze(), sampling_rate=sr, return_tensors="pt").input_features

# --- Without prompt ---
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0], skip_special_tokens=False))
print(processor.decode(output_without_prompt[0], skip_special_tokens=True))

# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Mexico city")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0], skip_special_tokens=False))
print(processor.decode(output_with_prompt[0], skip_special_tokens=True))

and this is the trace:

<|startoftranscript|><|en|><|transcribe|><|notimestamps|> San Francisco educators. She was teaching in Mexico City.<|endoftext|>
 San Francisco educators. She was teaching in Mexico City.
<|startofprev|> Mexico city<|startoftranscript|><|en|><|transcribe|><|notimestamps|> and<|endoftext|>
 and

When we don't pass prompts we get the expected output, but when we do pass prompts (that appear in the transcription) we end up with a bad output.

Note that we did not commit any code changes before running this script.

System:

  • pytorch 2.0.1
  • The test was made on CPU

AvivSham avatar May 18 '23 12:05 AvivSham

@AvivSham thanks for sharing, I looked at this and I think it may just be that prompting can be finicky. I believe the model perceives the prompt as previous context, so having 'Mexico city' be followed by 'San Francisco' with no grammar in between might've been viewed as unlikely by the model, which could then have led to further model confusion in successive generations.

I tried your example with the tiny model and the prompt actually corrected the output, and trying it with the medium Whisper model I was able to repro your issue but also address it by adding a period to the end of the prompt:

# --- Without prompt ---
output_without_prompt = model.generate(input_features)
print(processor.decode(output_with_prompt[0], skip_special_tokens=False))
# <|startoftranscript|><|en|><|transcribe|><|notimestamps|> San Francisco educators. She was teaching in Mexico City.<|endoftext|>
print(processor.decode(output_with_prompt[0], skip_special_tokens=True))
# San Francisco educators. She was teaching in Mexico City.

# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Mexico city.") # Added a period to the end
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0], skip_special_tokens=False))
# <|startofprev|> Mexico city.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> San Francisco educators. She was teaching in Mexico city.<|endoftext|>
print(processor.decode(output_with_prompt[0], skip_special_tokens=True))
# San Francisco educators. She was teaching in Mexico City.

connor-henderson avatar May 18 '23 14:05 connor-henderson

Awesome - thanks for the reviews @amyeroberts and @gante, and for the fast iteration and detailed explanations from you @connor-henderson! Excited to see this PR merged when confirmed as ready 🤗

Regarding prompt engineering, my advice would by to try and emulate a full sentence, complete with punctuation and casing, since really what we're providing as the 'prompt' is just the target transcription from a previous window (see https://github.com/openai/whisper/discussions/963#discussioncomment-4987057)

sanchit-gandhi avatar May 18 '23 17:05 sanchit-gandhi

Hi all, Thanks for the great work on adding prompt in 'model.generate'. Is it possible to add 'initial_prompt' in the Fine-Tune code with a 'prompt_use_rate' to control how often to add prompts to the sentences in training sets? So that we may improve the performance for some special prompts via prompt-tuning.

Johnson-NLP avatar May 19 '23 03:05 Johnson-NLP

@AvivSham Thanks for reporting and @connor-henderson thanks for investigating!

I think we're good to merge 👍

amyeroberts avatar May 19 '23 08:05 amyeroberts

Thank you so much for adding this! I've found that I occasionally get the following:

Traceback (most recent call last):
  File "G:\Conda\hfwhisper\lib\site-packages\transformers\models\whisper\modeling_whisper.py", line 1662, in generate
    return super().generate(
  File "G:\Conda\hfwhisper\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "G:\Conda\hfwhisper\lib\site-packages\transformers\generation\utils.py", line 1518, in generate
    return self.greedy_search(
  File "G:\Conda\hfwhisper\lib\site-packages\transformers\generation\utils.py", line 2345, in greedy_search
    next_token_logits = outputs.logits[:, -1, :]
IndexError: index -1 is out of bounds for dimension 1 with size 0

My workaround is to catch the exception and try again without the prompt_ids.

dgram0 avatar May 20 '23 23:05 dgram0

Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.

hollance avatar May 22 '23 09:05 hollance

@Johnson-NLP

Is it possible to add 'initial_prompt' in the Fine-Tune code with a 'prompt_use_rate' to control how often to add prompts to the sentences in training sets?

Sounds like an interesting idea. Would you mind opening a new issue for this? Thanks!

hollance avatar May 22 '23 10:05 hollance

To get prompting working with fine-tuning, we probably don't want to explicitly add 'prompted' examples per-se, but rather split longer examples up into shorter ones and feed them sequentially through the model, providing previous passages as 'context' to the model.

For example, if we had a training sample that looked like:

This is the first sentence. This is the second sentence. And finally, this is the third.

Currently what we do is feed it to the model all at once:

<|startoftranscript|> This is the first sentence. This is the second sentence. And finally, this is the third. <|endoftranscript|>

What we can do is feed the first sentence in:

<|startoftranscript|> This is the first sentence. <|endoftranscript|>

Then the second sentence, with the first sentence as context:

<|startofprev|> This is the first sentence.<|startoftranscript|> This is the second sentence. <|endoftranscript|>

And then the third, with both the first and second sentences as context:

<|startofprev|> This is the first sentence. This is the second sentence.<|startoftranscript|>  And finally, this is the third.<|endoftranscript|>

At inference time, we then just provide the "context" as our prompts:

<|startofprev|> This is the prompt.<|startoftranscript|> (model generates the rest)

See section 2.3 of the Whisper paper for an in-depth explanation as to how they achieve this during pre-training. We essentially want to do the same for fine-tuning.

For this to work, ideally we need an original sentence that is >> 30s in duration. That way when we split it up, we don't have super short examples that we feed to the model.

sanchit-gandhi avatar May 22 '23 16:05 sanchit-gandhi

Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.

I'll try reproducing in a small toy example. It's reproducible on my side with the fine-tuned large private model I've been working with.

dgram0 avatar May 23 '23 02:05 dgram0

Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.

The following triggers the bug on the 13th iterations of the loop. (Usually, it takes a lot more iterations.)

from datasets import load_dataset, DatasetDict
from transformers import WhisperForConditionalGeneration, WhisperProcessor

it = iter(load_dataset("librispeech_asr", "all", split="test.other", streaming=True))
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="English", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
prompt = 'some text rich in domain specific vocabulary lives here'
past_prompts = ["I am from the cutter lying off the coast"]
while it:
  _ = [next(it) for x in range(3)]
  clip = next(it)
  input_features = processor(clip['audio']['array'], sampling_rate=clip['audio']['sampling_rate'], return_tensors="pt").input_features
  prompt_ids = processor.get_prompt_ids(prompt + ' - ' + ' - '.join(past_prompts))
  pred_ids = model.generate(input_features, language="english", task="transcribe", max_new_tokens=128, prompt_ids=prompt_ids)
  result = processor.batch_decode(pred_ids, skip_special_tokens=True)[0].strip()
  result_text = result.removesuffix('.')
  print(result_text)
  if result_text != '':
    past_prompts.append(result_text)
    if len(past_prompts) > 12:
      past_prompts = past_prompts[1:]

dgram0 avatar May 23 '23 13:05 dgram0

@dgram0 thanks for sharing, I was able to repro this. As far as its relation to prompting I think this is another case of prompt sensitivity as opposed to a bug, but it may still be of interest with regards to Whisper generally since its the same error message as issue #22682.

I noticed that joining the prompts by ' - ' was causing the model to start predicting chinese characters, and using '. ' instead did not lead to the error (at least through 30 loops, at that point I stopped testing). I did notice degraded predictions over time though since a period did not necessarily belong after each result, and every now and again a chinese char was still predicted so. I'd just be cautious about how prompts are chained together.

connor-henderson avatar May 23 '23 15:05 connor-henderson

@connor-henderson It's a bit of a contrived example meant just to recreate the issue without having to loop too much and at the same time show what may be considered a normal use case. Even without it predicting non-English characters or words you'll eventually encounter the issue within a few hundred loops.

dgram0 avatar May 23 '23 17:05 dgram0