transformers
transformers copied to clipboard
[Whisper] Add SpecAugment
What does this PR do?
Hi @sanchit-gandhi @ArthurZucker,
Thanks for pointing out the flaw in the other PR (https://github.com/huggingface/transformers/pull/21063)! Here I will add SpecAugment to modeling_whisper.py
Several things have been done or to be done:
- [x] Return
attention_mask
byWhisperFeatureExtractor
, which will be used to guide the mask function along the time axis - [x] Rescale
attention_mask
from the sample level (48000) to the feature level (3000) byhop_length
(160). It is done insideWhisperFeatureExtractor
since thehop_length
is defined there. But I'm not sure if returnedattention_mask
has other utilities - [x] Copy
_compute_mask_indices
of wav2vec2, utility function to generate masks - [x] Add
_mask_input_features
to maskinput_features
, referring to_mask_hidden_states
in wav2vec2 - [x] Add related parameters to the model config
- [ ] Add test
- [ ] Update training script run_speech_recognition_seq2seq.py, adding
attention_mask
toprepare_dataset
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
The documentation is not available anymore as the PR was closed or merged.
Looks very nice now that everything's living in the modelling file! Great that you've leveraged very closely from Wav2Vec2 as well.
I see the problem that you're currently working around with the padded audio inputs! Here your passing an attention mask from the feature extractor to the model which tells the model where the audio inputs have been padded to 30s. When we mask using SpecAug, we don't want to mask any of these padded features, only the real audio inputs.
With regards to whether we should use an attention mask for SpecAug, it's hard to know because we don't have a reference implementation. However, my feeling is that we should use an attention mask and only pad the real audio inputs, not any of the padded zeros. It makes little sense to pass an audio of length 10s to the model and then mask the spectrogram from 20-25s (which would just be silence...)
WDYT here @bofenghuang @ArthurZucker? Pass an attention mask and only compute SpecAug on the real audio inputs? Or ditch the attention mask and compute SpecAug uniformly across the 30s input (whether that input be audio or padded silence)?
Hi @sanchit-gandhi, thanks for the explantation! And I'm agree with you, here I tried to mask only real values using attention_mask
@bofenghuang is it ready for review?
@ArthurZucker yes please ! One validated, we could add this option to run_speech_recognition_seq2seq.py
Yes! Let's go for numpy! Especially given that current users would have a breaking change if they do not have librosa.
@sanchit-gandhi @ArthurZucker thanks for the review ! Just add the test !
Can you make sure to fix the conflicts, and rebase on main to use the latest linter? (you need to do another pip install -e ".[dev]"
@ArthurZucker thanks for the tips ! Think it's done
Will review again!
Done ! Thanks to all the reviews and the discussions @ArthurZucker @sanchit-gandhi @sgugger !
Thanks a lot for your contributions! And congrats on the PR 😉
@bofenghuang thanks for PR
looking forward to this (is it already available)
Update training script run_speech_recognition_seq2seq.py, adding attention_mask to prepare_dataset
any tips / hint ,how to apply it during training also very helpful
Hey @acul3! You just need to set these config parameters according to your spec aug requirements:
https://github.com/huggingface/transformers/blob/8c40ba73d8091ebe0bdc8da5b634bf7951d18f99/src/transformers/models/whisper/configuration_whisper.py#L139-L167
The rest will be taken care for you in the training script!
The easiest way of doing this is probably by first downloading the config to your local device and setting the SpecAug params:
from transformers import WhisperConfig
config = WhisperConfig.from_pretrained("openai/whisper-tiny") # update to the checkpoint you want to fine-tune from
config.apply_spec_augment = True
... # set all the other spec aug params as required
config.save_pretrained("/some/local/path/to/save")
Then in the training script, either add the argument to your bash script:
--config_name="/some/local/path/to/save" \
Or load the config from the place you saved it if you're using a notebook:
config = WhisperConfig.from_pretrained("/some/local/path/to/save")
@acul3 please see this PR https://github.com/huggingface/transformers/pull/21942 :)