What does this PR do?

Adds fine-tuning support for SpeechT5, in particular the TTS model.

The loss function is a combination of L1 loss for the mel-spectrograms, BCE for the stop token prediction, and (optionally) guided attention loss to persuade the cross-attentions to be diagonal.

The STFT feature extraction has been sped up, which also means it currently assumes the frame size is a power of two and throws an error otherwise.

The feature extractor no longer outputs a stop_labels target. Padded areas in the spectrogram target are assumed to have the value -100 during training; from this the stop labels are computed automatically.

Various other small fixes to the tokenizer, processor, etc to support fine-tuning.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Feb 27 '23 16:02 hollance

The documentation is not available anymore as the PR was closed or merged.

Feb 27 '23 16:02 HuggingFaceDocBuilderDev

Requesting review from @ArthurZucker for the custom STFT / log-Mel feature extraction components (feature_extraction_speecht5.py is the file of interest)

Mar 23 '23 09:03 sanchit-gandhi

Gently pinging @ArthurZucker :)

Apr 03 '23 10:04 sanchit-gandhi

Will review in 1h! Sorry for the delay

Apr 03 '23 10:04 ArthurZucker

Have the slow integration tests for the SpeechT5 models been run to check outputs are the same with the processing updates?

The outputs are not the same because the processing of the labels changed. But that's OK since the labels weren't used up to this point anyway.

Am I right in understanding stop_labels were never used (and so removal doesn't affect things?)

Correct.

With reduction_factor being moved to shift_spectrograms_right, does this effectively mean the input_values output from the processor has changed for the same config?

It didn't affect the input_values, only the labels. So nothing changed there for the normal operation of the model.

Apr 12 '23 09:04 hollance

@amyeroberts If you're OK with the changes, I think this can be merged now. The failing tests seem unrelated to SpeechT5.

Apr 12 '23 16:04 hollance

I'm pretty sure no one was using any of these properties before, since we only released SpeechT5 very recently and no one would have used it for training yet. Adding deprecation warnings seems excessive to me in this case.

Apr 13 '23 08:04 hollance

OK, put frame_signal_scale and reduction_factor back and added a deprecation warning.

Apr 13 '23 13:04 hollance

If you're all happy with it, feel free to merge (I don't have rights for that). 😃

Apr 18 '23 08:04 hollance

@hollance - sorry, my bad, I thought you did!

Apr 18 '23 09:04 amyeroberts

transformers
transformers copied to clipboard

TTS fine-tuning for SpeechT5

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

TTS fine-tuning for SpeechT5

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard