What does this PR do?

Hi @sanchit-gandhi @ArthurZucker,

Thanks for pointing out the flaw in the other PR (https://github.com/huggingface/transformers/pull/21063)! Here I will add SpecAugment to modeling_whisper.py

Several things have been done or to be done:

[x] Return attention_mask by WhisperFeatureExtractor, which will be used to guide the mask function along the time axis
[x] Rescale attention_mask from the sample level (48000) to the feature level (3000) by hop_length (160). It is done inside WhisperFeatureExtractor since the hop_length is defined there. But I'm not sure if returned attention_mask has other utilities
[x] Copy _compute_mask_indices of wav2vec2, utility function to generate masks
[x] Add _mask_input_features to mask input_features, referring to _mask_hidden_states in wav2vec2
[x] Add related parameters to the model config
[ ] Add test
[ ] Update training script run_speech_recognition_seq2seq.py, adding attention_mask to prepare_dataset

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Jan 25 '23 12:01 bofenghuang

The documentation is not available anymore as the PR was closed or merged.

Jan 25 '23 12:01 HuggingFaceDocBuilderDev

Looks very nice now that everything's living in the modelling file! Great that you've leveraged very closely from Wav2Vec2 as well.

I see the problem that you're currently working around with the padded audio inputs! Here your passing an attention mask from the feature extractor to the model which tells the model where the audio inputs have been padded to 30s. When we mask using SpecAug, we don't want to mask any of these padded features, only the real audio inputs.

With regards to whether we should use an attention mask for SpecAug, it's hard to know because we don't have a reference implementation. However, my feeling is that we should use an attention mask and only pad the real audio inputs, not any of the padded zeros. It makes little sense to pass an audio of length 10s to the model and then mask the spectrogram from 20-25s (which would just be silence...)

WDYT here @bofenghuang @ArthurZucker? Pass an attention mask and only compute SpecAug on the real audio inputs? Or ditch the attention mask and compute SpecAug uniformly across the 30s input (whether that input be audio or padded silence)?

Jan 30 '23 15:01 sanchit-gandhi

Hi @sanchit-gandhi, thanks for the explantation! And I'm agree with you, here I tried to mask only real values using attention_mask

Jan 30 '23 16:01 bofenghuang

@bofenghuang is it ready for review?

Feb 08 '23 10:02 ArthurZucker

@ArthurZucker yes please ! One validated, we could add this option to run_speech_recognition_seq2seq.py

Feb 08 '23 10:02 bofenghuang

Yes! Let's go for numpy! Especially given that current users would have a breaking change if they do not have librosa.

Feb 20 '23 07:02 ArthurZucker

@sanchit-gandhi @ArthurZucker thanks for the review ! Just add the test !

Feb 20 '23 14:02 bofenghuang

Can you make sure to fix the conflicts, and rebase on main to use the latest linter? (you need to do another pip install -e ".[dev]"

Feb 20 '23 14:02 ArthurZucker

@ArthurZucker thanks for the tips ! Think it's done

Feb 20 '23 15:02 bofenghuang

Will review again!

Feb 20 '23 15:02 ArthurZucker

Done ! Thanks to all the reviews and the discussions @ArthurZucker @sanchit-gandhi @sgugger !

Feb 24 '23 09:02 bofenghuang

Thanks a lot for your contributions! And congrats on the PR 😉

Feb 24 '23 10:02 ArthurZucker

@bofenghuang thanks for PR

looking forward to this (is it already available)

Update training script run_speech_recognition_seq2seq.py, adding attention_mask to prepare_dataset

any tips / hint ,how to apply it during training also very helpful

Feb 28 '23 11:02 acul3

Hey @acul3! You just need to set these config parameters according to your spec aug requirements:

https://github.com/huggingface/transformers/blob/8c40ba73d8091ebe0bdc8da5b634bf7951d18f99/src/transformers/models/whisper/configuration_whisper.py#L139-L167

The rest will be taken care for you in the training script!

The easiest way of doing this is probably by first downloading the config to your local device and setting the SpecAug params:

from transformers import WhisperConfig

config = WhisperConfig.from_pretrained("openai/whisper-tiny")  # update to the checkpoint you want to fine-tune from

config.apply_spec_augment = True
...  # set all the other spec aug params as required

config.save_pretrained("/some/local/path/to/save")

Then in the training script, either add the argument to your bash script:

--config_name="/some/local/path/to/save" \

Or load the config from the place you saved it if you're using a notebook:

config = WhisperConfig.from_pretrained("/some/local/path/to/save")

Mar 03 '23 16:03 sanchit-gandhi

@acul3 please see this PR https://github.com/huggingface/transformers/pull/21942 :)

Mar 08 '23 17:03 bofenghuang

transformers
transformers copied to clipboard

[Whisper] Add SpecAugment

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

[Whisper] Add SpecAugment

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard