What does this PR do?

Add the SpeechT5 model to Transformers. See also https://github.com/huggingface/transformers/issues/17569

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Current status of this PR

To-do list

verify that the ASR model (SpeechT5ForSpeechToText) can be fine-tuned on new data
verify that the TTS model (SpeechT5ForTextToSpeech) can be fine-tuned on new data (still implement loss)
verify that the voice conversion model (SpeechT5ForSpeechToSpeech) can be fine-tuned on new data (still implement loss)
replace the "Matthijs/speech_xxx" checkpoints with official ones on the HF Hub

We decided not to implement SpeechT5ForPreTraining for now.

Notes

When attention_mask is not all ones, the output is slightly different from the original model at the point where the mask goes from 1 to 0. This is due to a difference in how both models subsample the attention mask (_get_feature_vector_attention_mask in SpeechT5SpeechEncoderPrenet). This is not a big deal but may cause tiny ripple differences in the outputs elsewhere.
The original model sets the attention_mask to 0 for padding tokens in the decoder input_ids. I disabled this because it does not play nice with model.generate. So the predictions are slightly different for the timesteps following the padding token (which really only happens when the sequence is complete but other sequences in the same batch have not completed yet).

Sep 07 '22 14:09 hollance

Hello @hollance. I'll be helping with the Transformer Encoder-Decoder.

Sep 08 '22 09:09 kasmith11

Hi @hollance I will be helping with TextDecoderPrenet / TextDecoderPostnet

Sep 08 '22 13:09 anuragshas

@anuragshas:

I will be helping with TextDecoderPrenet / TextDecoderPostnet

Great! Are you also interested in looking at the tokenizer, since I believe the text pre- and post-net need to use that. The original model uses a BPE tokenizer (there is a download link in the README). I'm not sure what the current tokenizer is in the code, it was copied from Wav2Vec2 but I didn't look at it in detail yet.

Sep 08 '22 14:09 hollance

The SpeechEncoderPrenet is complete now. It gives the same results as the original model. However, there still are some TODOs in this part of the code to look at later.

Sep 20 '22 14:09 hollance

The encoder is complete and verified to work (although there are some parts that possibly could be rewritten, marked with TODO). I've started adding the decoder but this doesn't work yet (didn't have time yet to fix it up).

Sep 22 '22 14:09 hollance

Thanks for the in-depth review, @sanchit-gandhi!

Wondering if it would have been better to copy everything from Speech2Text, including for the encoder? I know I pointed you in the direction of W2V2! But it has a whole bunch of functionality that is pretty specific to W2V2 pre-training that isn't used in SpeechT5 (e.g. SpecAug). It might be possible to condense the code by copying the Speech2Text encoder model, rather than that from W2V2.

Aside from pre-training stuff, I did bring the code in line with Speech2Text. It's not exactly the same but a bit of a hybrid between Wav2Vec2 and Speech2Text. ;-)

We can change the function _get_feature_vector_attention_mask to match the original implementation if there's a difference. Better to have correctness here rather than duplicated code from UniSpeech.

I don't think either approach is more "correct" than the other. The question is: at the point where the attention mask goes from 1 to 0, this may happen halfway inside a block of 320 samples (or whatever the frame size is). Does that partially-padded block get included in the predictions or is the entire block considered to be padding and gets excluded? Basically: do we round up or down? SpeechT5 simply makes a different choice here than _get_feature_vector_attention_mask but either one works fine.

Ideally SpeechT5Model should load all the weights for the Transformer backbone, and SpeechT5ForConditionalGeneration all the weights for the Transformer backbone and pre-/post-nets. We could try and match the attribute names more closely between SpeechT5Model and SpeechT5ForConditionalGeneration, i.e. always assigning the encoder as self.encoder (rather than self.wrapped_encoder). And then try to make sure the structures follow as closely as possible. Not loading the decoder weights for CTC is fine!

The problem here is that SpeechT5ForConditionalGeneration will have encoder.wrapped_encoder in the checkpoint while SpeechT5Model only has encoder. I could fix this by making a "fake" wrapper that just calls the encoder without applying a pre-net, so that SpeechT5Model also has the encoder.wrapped_encoder path. (BartForCausalLM does something similar so it's not unprecedented.) EDIT: implemented this. Now the models load as expected.

Oct 17 '22 10:10 hollance

Aside from pre-training stuff, I did bring the code in line with Speech2Text. It's not exactly the same but a bit of a hybrid between Wav2Vec2 and Speech2Text

That sounds great - this is a new model that sits somewhere between the two (acoustic encoder is more Wav2Vec2-like, but the transformer decoder is similar to Speech2Text), so taking elements from each is no issue!

Basically: do we round up or down? SpeechT5 simply makes a different choice here than _get_feature_vector_attention_mask but either one works fine.

I see, the numerical differences are tiny as you say. Feel free to pick the one you think is more appropriate! I'd opt to bring ours in-line with the 'official' implementation, but you know more about it!

EDIT: implemented this. Now the models load as expected.

Amazing! With similar logic to BartForCausalLM in the end?

Nov 04 '22 17:11 sanchit-gandhi

Design issues & questions

Philosophical question for the Transformers team:

TL;DR: SpeechT5 is different from the other models in Transformers and doesn't quite fit in with the design of the library. Is the current approach OK, or should we split it up into multiple different, completely independent models?

Some background on the model: SpeechT5 is a speech-to-text (or ASR) model, but also a text-to-speech (TTS) model, as well as a speech-to-speech model, and even text-to-text. These are four different model types but they all share the same encoder-decoder structure. The only difference is that they have different so-called pre-nets and post-nets.

For example, in the ASR model the encoder pre-net is basically the first set of layers from Wav2Vec2, and the decoder pre- and post-nets are essentially the first and last layers of BART. By swapping in different pre & post-nets, and fine-tuning the model, the same pretrained architecture can handle different tasks.

So far I've implemented only the ASR and TTS model, but there are also checkpoints for voice conversion (speech-to-speech) and pretraining that we might want to add.

Specifically, these are the issues I ran into:

The current design of Transformers assumes that a model always has one kind of input and one kind of output. This is not true for SpeechT5: some versions of the model have text as input, others speech. Likewise for the output.
In other seq2seq models, there is a ForConditionalGeneration class that does the predictions. Here, we have at least two such classes, so I named them ForSpeechToText (ASR) and ForTextToSpeech (TTS) instead.
Normally, we'd have an Encoder and a Decoder class. In SpeechT5, the encoder and decoder classes also need to run a pre-net. This is why there are wrapper classes such as:
- SpeechT5EncoderWithSpeechPrenet
- SpeechT5EncoderWithTextPrenet
- SpeechT5EncoderWithoutPrenet
- SpeechT5DecoderWithSpeechPrenet
- SpeechT5DecoderWithTextPrenet
- SpeechT5DecoderWithoutPrenet
The SpeechT5ForSpeechToText and SpeechT5ForTextToSpeech models will instantiate the appropriate encoder and decoder wrapper classes (and also run the post-net).

The base SpeechT5Model class needs to have special logic to handle these different wrappers. It shouldn't be used with the "naked" SpeechT5Encoder / SpeechT5Decoder classes, since they don't have any pre-nets.

This approach works, but it's also trying to shoehorn a model that doesn't quite fit into the design of Transformers.
One side-effect of having these different pre- and post-nets, is that SpeechT5Model cannot know in advance what sort of data it gets as input. The input could be tokens (input_ids) or raw speech (input_values) or spectrograms (input_features).

To allow for this ambiguity, I named the input argument input_values everywhere. However, that's the same term that is used for raw audio input. None of the other terms (input_ids or input_features or input_embeds) is really suitable either. Suggestions for a better generic input name that covers the three different modalities are welcome. 😄
Our seq2seq models combine the preprocessing for the encoder and for the decoder into a Processor class. The SpeechT5 ASR model needs a different Processor than the TTS model. So I made SpeechT5ProcessorForSpeechToText and SpeechT5ProcessorForTextToSpeech. These also use different FeatureExtractor objects as they process the audio in different ways.

In the design of Transformers it is assumed each model only has one processor / feature extractor, but here we have two, and we might need a third one (SpeechT5ProcessorForSpeechToSpeech) for the voice conversion checkpoint.

Having multiple processors / feature extractors for the some model type doesn't work very well with the Auto classes, as this assumes there is always only one.
The TTS model applies a vocoder to the output of the encoder-decoder model. The weights for this vocoder model are kept separate from the main model and it has its own Config object, but the implementation lives in modeling_speecht5.py. Currently there is no way to share vocoders between audio models, but they probably should live in their own separate world. (Also affects the WIP SpeechToSpeech and FastSpeech2 models.)
The model.generate() logic works fine for the ASR model but not for the TTS model. It would be nice if the GenerationMixin could handle the TTS generation logic as well.
There is a version of the ASR model that only uses the encoder, which outputs CTC tokens. This uses its own tokenizer, SpeechT5CTCTokenizer, that derives from SpeechT5Tokenizer. I haven't seen that pattern for any of the other models in the library.
The SpeechT5ProcessorForSpeechToSpeech doesn't really fit in with the design of ProcessorMixin. It has two feature extractors and no tokenizer. In principle this works, except saving two feature extractors is not supported, as they overwrite each others properties. (Could fix this by overriding the save/load_pretrained logic to add namespacing to the JSON file.)
Pipelines don't work. When you do the following, it always tries to instantiate the ForCTC model. This happens because we have both a CTC and a Seq2Seq model for ASR, while the pipeline logic assumes there's only one of these.

generator = pipeline(task="automatic-speech-recognition", model="Matthijs/speecht5_asr")

Except for fixing some small issues, the implementation of SpeechT5 is mostly complete, so you can look at the source in this PR in case the above sounds a bit vague. 😃

What I'd like to know is: How do you feel about the approach I've taken to make this model fit into Transformers?

Obviously, I wouldn't expect a complete redesign of Transformers just to accomodate SpeechT5, but I would like some feedback on whether you think the above decisions are acceptable. It works but it also kind of breaks some of the conventions that users of the library might expect.

An alternative would be to create completely different models in different folders, such as speecht5_asr and speecht5_tts and to treat these as unrelated. One of these would largely be a copy of the other, but with different pre- and post-nets. (We could simply ignore the ForPreTraining model, as it's unlikely to be in high demand.)

Jan 03 '23 15:01 hollance

@hollance Re .generate() not supporting TTS -- transformers doesn't have any TTS model, and in fact .generate() only supports text (or other sets of integers) output. I'm not sure whether expanding .generate() is the way to go, I would have to think about it, but I'd be happy to support in whatever is needed from the conditional generation angle!

@sanchit-gandhi you folks are working on the generation of audio, correct? Do you have plans for generate() or anything related to conditional generation?

Jan 03 '23 15:01 gante

The documentation is not available anymore as the PR was closed or merged.

Jan 05 '23 14:01 HuggingFaceDocBuilderDev

Vocoders

The TTS and voice conversion models use a vocoder to convert the predicted mel spectrogram into an audio waveform. Currently this is implemented as SpeechT5HiFiGAN inside the SpeechT5 modeling file.

The vocoder is treated as a separate model (on the Hub under Matthijs/speecht5_hifigan). It has its own weights and config that are separate from the SpeechT5 model.

To generate speech, you optionally pass the vocoder object to model.generate_speech(). Without it, this method outputs the spectrogram. With the vocoder, it outputs speech.

This allows the user to provide their own vocoder instead of the pretrained one.

(Note that automapping of SpeechT5HiFiGANConfig is not working in this implementation because it has its own config file.)

My suggestion is that we treat vocoders as separate model types, just like feature extractors and tokenizers, and that they are owned by the processor, which calls the vocoder in the postprocessing / decoding step.

Note that the original checkpoint for the voice conversion model comes with trained vocoders for the different voices, that do not use the Hi-Fi GAN architecture but the one from Parallel WaveGAN. I did not implement this, since the Hi-Fi GAN vocoder works fine here too.

Jan 18 '23 11:01 hollance

@sanchit-gandhi

Is SpeechT5ProcessorForSpeechToSpeech working or are the feature extractors still overriding each other?

They are still overriding each other. I think the only way to fix this is to override save_pretrained and from_pretrained that are inherited from ProcessorMixin.

Even though what gets saved into preprocessor_config.json is wrong, the processor actually does process the data OK, so we could get away with it — but this is mostly due to both feature extractors using the same property names. And that would be asking for bugs when someone uses different configuration values.

Jan 24 '23 11:01 hollance

SpeechT5ProcessorForSpeechToSpeech

(Writing this in case we want to fix this issue properly at some point.)

The problem: processor objects are assumed to have a tokenizer and a feature extractor. The config for the feature extractor is saved to preprocessor_config.json. However, SpeechT5ProcessorForSpeechToSpeech has no tokenizer and two feature extractors. As a result, the second feature extractor overwrites the JSON from the first.

In my opinion, the correct approach here would be to not hardcode the filename for feature extractors. Rather than using the FEATURE_EXTRACTOR_NAME constant, each feature extractor would get a class variable feature_extractor_config_name = FEATURE_EXTRACTOR_NAME. By default this is preprocessor_config.json but a class can override it if necessary. For the S2S model, we'd have preprocessor_encoder_config.json and preprocessor_decoder_config.json, for example.

However, the above solution would affect all of the models in Transformers, and it may still not work due to certain assumptions being made (i.e. you need to know the class name of the feature extractor so you can look up what its filename should be, which is a chicken-and-egg problem). So making this change just for SpeechT5 seems excessive at this point.

Hacks I've tried to work around this:

Override save_pretrained in SpeechT5ProcessorForSpeechToSpeech to save each feature extractor's config in a subdir. This works OK for saving. (It requires some changes to _upload_modified_files so that it would pick up the config files inside these subdirs.)

However, it does not work for loading, since the feature extractor's from_pretrained does not know that it's supposed to look inside a subdir, and there's no way to tell it to do so. To fix this would require duplicating a lot of code from FeatureExtractionMixIn. And even then, it doesn't work with AutoProcessor.
Create a FeatureExtractionMixinHack class that extends FeatureExtractionMixin. This duplicates the loading and saving code in order to save using different filenames for each feature extractor. The SpeechT5 feature extractors now extend from this. Very messy and brittle. Not even sure if works OK in all situations.
Save the properties of both feature extractors in the same file, as nested dictionaries. This requires massive changes as the code is currently set up to save one file per object.

For now, the "solution" is not not use save_pretrained and from_pretrained with SpeechT5ProcessorForSpeechToSpeech and pretend everything is fine. 😅

Jan 31 '23 11:01 hollance

To the reviewer: This PR has been reviewed by the audio team several times already, and this is (hopefully 😄) the final review before merging.

The only remaining thing is replacing the checkpoints with official ones. But I'd rather wait with creating these until the PR has been approved.

(We decided not to implement fine-tuning right now.)

Jan 31 '23 16:01 hollance

Hi @sgugger, I made the fixes you asked for. There is now just one feature extractor / processor and the auto classes have been removed again.

(There are two tests that seem to fail but they're not related to this PR.)

Feb 01 '23 16:02 hollance

@sgugger Hi Sylvain, I made the changes to the feature extractor you asked for. Also, the checkpoints have been updated to point to the microsoft organization on the hub. If all is good with you, feel free to merge this PR at your leisure. Thanks! 😄

(Again there is a failing test but this seems unrelated to this PR.)

EDIT: Unless @sanchit-gandhi wants to give this the final once-over too.

Feb 02 '23 17:02 hollance

Good to merge for me too!

Feb 03 '23 16:02 sgugger

@sgugger I don't have rights to merge this myself. So someone else needs to press that button. 😅

Feb 03 '23 17:02 hollance

transformers
transformers copied to clipboard

[WIP] add SpeechT5 model

What does this PR do?

Before submitting

Who can review?

Current status of this PR

To-do list

Notes

Design issues & questions

Vocoders

SpeechT5ProcessorForSpeechToSpeech

transformers transformers copied to clipboard

[WIP] add SpeechT5 model

What does this PR do?

Before submitting

Who can review?

Current status of this PR

To-do list

Notes

Design issues & questions

Vocoders

SpeechT5ProcessorForSpeechToSpeech

transformers
transformers copied to clipboard