transformers
transformers copied to clipboard
[WIP] add SpeechT5 model
What does this PR do?
Add the SpeechT5 model to Transformers. See also https://github.com/huggingface/transformers/issues/17569
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Current status of this PR
To-do list
- verify that the ASR model (
SpeechT5ForSpeechToText) can be fine-tuned on new data - verify that the TTS model (
SpeechT5ForTextToSpeech) can be fine-tuned on new data (still implement loss) - verify that the voice conversion model (
SpeechT5ForSpeechToSpeech) can be fine-tuned on new data (still implement loss) - replace the "Matthijs/speech_xxx" checkpoints with official ones on the HF Hub
We decided not to implement SpeechT5ForPreTraining for now.
Notes
-
When
attention_maskis not all ones, the output is slightly different from the original model at the point where the mask goes from 1 to 0. This is due to a difference in how both models subsample the attention mask (_get_feature_vector_attention_maskinSpeechT5SpeechEncoderPrenet). This is not a big deal but may cause tiny ripple differences in the outputs elsewhere. -
The original model sets the
attention_maskto 0 for padding tokens in the decoderinput_ids. I disabled this because it does not play nice withmodel.generate. So the predictions are slightly different for the timesteps following the padding token (which really only happens when the sequence is complete but other sequences in the same batch have not completed yet).
Hello @hollance. I'll be helping with the Transformer Encoder-Decoder.
Hi @hollance I will be helping with TextDecoderPrenet / TextDecoderPostnet
@anuragshas:
I will be helping with TextDecoderPrenet / TextDecoderPostnet
Great! Are you also interested in looking at the tokenizer, since I believe the text pre- and post-net need to use that. The original model uses a BPE tokenizer (there is a download link in the README). I'm not sure what the current tokenizer is in the code, it was copied from Wav2Vec2 but I didn't look at it in detail yet.
The SpeechEncoderPrenet is complete now. It gives the same results as the original model. However, there still are some TODOs in this part of the code to look at later.
The encoder is complete and verified to work (although there are some parts that possibly could be rewritten, marked with TODO). I've started adding the decoder but this doesn't work yet (didn't have time yet to fix it up).
Thanks for the in-depth review, @sanchit-gandhi!
Wondering if it would have been better to copy everything from Speech2Text, including for the encoder? I know I pointed you in the direction of W2V2! But it has a whole bunch of functionality that is pretty specific to W2V2 pre-training that isn't used in SpeechT5 (e.g. SpecAug). It might be possible to condense the code by copying the Speech2Text encoder model, rather than that from W2V2.
Aside from pre-training stuff, I did bring the code in line with Speech2Text. It's not exactly the same but a bit of a hybrid between Wav2Vec2 and Speech2Text. ;-)
We can change the function
_get_feature_vector_attention_maskto match the original implementation if there's a difference. Better to have correctness here rather than duplicated code from UniSpeech.
I don't think either approach is more "correct" than the other. The question is: at the point where the attention mask goes from 1 to 0, this may happen halfway inside a block of 320 samples (or whatever the frame size is). Does that partially-padded block get included in the predictions or is the entire block considered to be padding and gets excluded? Basically: do we round up or down? SpeechT5 simply makes a different choice here than _get_feature_vector_attention_mask but either one works fine.
Ideally SpeechT5Model should load all the weights for the Transformer backbone, and SpeechT5ForConditionalGeneration all the weights for the Transformer backbone and pre-/post-nets. We could try and match the attribute names more closely between SpeechT5Model and SpeechT5ForConditionalGeneration, i.e. always assigning the encoder as self.encoder (rather than self.wrapped_encoder). And then try to make sure the structures follow as closely as possible. Not loading the decoder weights for CTC is fine!
The problem here is that SpeechT5ForConditionalGeneration will have encoder.wrapped_encoder in the checkpoint while SpeechT5Model only has encoder. I could fix this by making a "fake" wrapper that just calls the encoder without applying a pre-net, so that SpeechT5Model also has the encoder.wrapped_encoder path. (BartForCausalLM does something similar so it's not unprecedented.) EDIT: implemented this. Now the models load as expected.
Aside from pre-training stuff, I did bring the code in line with Speech2Text. It's not exactly the same but a bit of a hybrid between Wav2Vec2 and Speech2Text
That sounds great - this is a new model that sits somewhere between the two (acoustic encoder is more Wav2Vec2-like, but the transformer decoder is similar to Speech2Text), so taking elements from each is no issue!
Basically: do we round up or down? SpeechT5 simply makes a different choice here than
_get_feature_vector_attention_maskbut either one works fine.
I see, the numerical differences are tiny as you say. Feel free to pick the one you think is more appropriate! I'd opt to bring ours in-line with the 'official' implementation, but you know more about it!
EDIT: implemented this. Now the models load as expected.
Amazing! With similar logic to BartForCausalLM in the end?
Design issues & questions
Philosophical question for the Transformers team:
TL;DR: SpeechT5 is different from the other models in Transformers and doesn't quite fit in with the design of the library. Is the current approach OK, or should we split it up into multiple different, completely independent models?
Some background on the model: SpeechT5 is a speech-to-text (or ASR) model, but also a text-to-speech (TTS) model, as well as a speech-to-speech model, and even text-to-text. These are four different model types but they all share the same encoder-decoder structure. The only difference is that they have different so-called pre-nets and post-nets.
For example, in the ASR model the encoder pre-net is basically the first set of layers from Wav2Vec2, and the decoder pre- and post-nets are essentially the first and last layers of BART. By swapping in different pre & post-nets, and fine-tuning the model, the same pretrained architecture can handle different tasks.
So far I've implemented only the ASR and TTS model, but there are also checkpoints for voice conversion (speech-to-speech) and pretraining that we might want to add.
Specifically, these are the issues I ran into:
-
The current design of Transformers assumes that a model always has one kind of input and one kind of output. This is not true for SpeechT5: some versions of the model have text as input, others speech. Likewise for the output.
-
In other seq2seq models, there is a
ForConditionalGenerationclass that does the predictions. Here, we have at least two such classes, so I named themForSpeechToText(ASR) andForTextToSpeech(TTS) instead. -
Normally, we'd have an
Encoderand aDecoderclass. In SpeechT5, the encoder and decoder classes also need to run a pre-net. This is why there are wrapper classes such as:- SpeechT5EncoderWithSpeechPrenet
- SpeechT5EncoderWithTextPrenet
- SpeechT5EncoderWithoutPrenet
- SpeechT5DecoderWithSpeechPrenet
- SpeechT5DecoderWithTextPrenet
- SpeechT5DecoderWithoutPrenet
The
SpeechT5ForSpeechToTextandSpeechT5ForTextToSpeechmodels will instantiate the appropriate encoder and decoder wrapper classes (and also run the post-net).The base
SpeechT5Modelclass needs to have special logic to handle these different wrappers. It shouldn't be used with the "naked"SpeechT5Encoder/SpeechT5Decoderclasses, since they don't have any pre-nets.This approach works, but it's also trying to shoehorn a model that doesn't quite fit into the design of Transformers.
-
One side-effect of having these different pre- and post-nets, is that
SpeechT5Modelcannot know in advance what sort of data it gets as input. The input could be tokens (input_ids) or raw speech (input_values) or spectrograms (input_features).To allow for this ambiguity, I named the input argument
input_valueseverywhere. However, that's the same term that is used for raw audio input. None of the other terms (input_idsorinput_featuresorinput_embeds) is really suitable either. Suggestions for a better generic input name that covers the three different modalities are welcome. 😄 -
Our seq2seq models combine the preprocessing for the encoder and for the decoder into a
Processorclass. The SpeechT5 ASR model needs a different Processor than the TTS model. So I madeSpeechT5ProcessorForSpeechToTextandSpeechT5ProcessorForTextToSpeech. These also use differentFeatureExtractorobjects as they process the audio in different ways.In the design of Transformers it is assumed each model only has one processor / feature extractor, but here we have two, and we might need a third one (
SpeechT5ProcessorForSpeechToSpeech) for the voice conversion checkpoint.Having multiple processors / feature extractors for the some model type doesn't work very well with the
Autoclasses, as this assumes there is always only one. -
The TTS model applies a vocoder to the output of the encoder-decoder model. The weights for this vocoder model are kept separate from the main model and it has its own
Configobject, but the implementation lives inmodeling_speecht5.py. Currently there is no way to share vocoders between audio models, but they probably should live in their own separate world. (Also affects the WIP SpeechToSpeech and FastSpeech2 models.) -
The
model.generate()logic works fine for the ASR model but not for the TTS model. It would be nice if theGenerationMixincould handle the TTS generation logic as well. -
There is a version of the ASR model that only uses the encoder, which outputs CTC tokens. This uses its own tokenizer,
SpeechT5CTCTokenizer, that derives fromSpeechT5Tokenizer. I haven't seen that pattern for any of the other models in the library. -
The
SpeechT5ProcessorForSpeechToSpeechdoesn't really fit in with the design ofProcessorMixin. It has two feature extractors and no tokenizer. In principle this works, except saving two feature extractors is not supported, as they overwrite each others properties. (Could fix this by overriding the save/load_pretrained logic to add namespacing to the JSON file.) -
Pipelines don't work. When you do the following, it always tries to instantiate the
ForCTCmodel. This happens because we have both a CTC and a Seq2Seq model for ASR, while the pipeline logic assumes there's only one of these.
generator = pipeline(task="automatic-speech-recognition", model="Matthijs/speecht5_asr")
Except for fixing some small issues, the implementation of SpeechT5 is mostly complete, so you can look at the source in this PR in case the above sounds a bit vague. 😃
What I'd like to know is: How do you feel about the approach I've taken to make this model fit into Transformers?
Obviously, I wouldn't expect a complete redesign of Transformers just to accomodate SpeechT5, but I would like some feedback on whether you think the above decisions are acceptable. It works but it also kind of breaks some of the conventions that users of the library might expect.
An alternative would be to create completely different models in different folders, such as speecht5_asr and speecht5_tts and to treat these as unrelated. One of these would largely be a copy of the other, but with different pre- and post-nets. (We could simply ignore the ForPreTraining model, as it's unlikely to be in high demand.)
@hollance Re .generate() not supporting TTS -- transformers doesn't have any TTS model, and in fact .generate() only supports text (or other sets of integers) output. I'm not sure whether expanding .generate() is the way to go, I would have to think about it, but I'd be happy to support in whatever is needed from the conditional generation angle!
@sanchit-gandhi you folks are working on the generation of audio, correct? Do you have plans for generate() or anything related to conditional generation?
The documentation is not available anymore as the PR was closed or merged.
Vocoders
The TTS and voice conversion models use a vocoder to convert the predicted mel spectrogram into an audio waveform. Currently this is implemented as SpeechT5HiFiGAN inside the SpeechT5 modeling file.
The vocoder is treated as a separate model (on the Hub under Matthijs/speecht5_hifigan). It has its own weights and config that are separate from the SpeechT5 model.
To generate speech, you optionally pass the vocoder object to model.generate_speech(). Without it, this method outputs the spectrogram. With the vocoder, it outputs speech.
This allows the user to provide their own vocoder instead of the pretrained one.
(Note that automapping of SpeechT5HiFiGANConfig is not working in this implementation because it has its own config file.)
My suggestion is that we treat vocoders as separate model types, just like feature extractors and tokenizers, and that they are owned by the processor, which calls the vocoder in the postprocessing / decoding step.
Note that the original checkpoint for the voice conversion model comes with trained vocoders for the different voices, that do not use the Hi-Fi GAN architecture but the one from Parallel WaveGAN. I did not implement this, since the Hi-Fi GAN vocoder works fine here too.
@sanchit-gandhi
- Is
SpeechT5ProcessorForSpeechToSpeechworking or are the feature extractors still overriding each other?
They are still overriding each other. I think the only way to fix this is to override save_pretrained and from_pretrained that are inherited from ProcessorMixin.
Even though what gets saved into preprocessor_config.json is wrong, the processor actually does process the data OK, so we could get away with it — but this is mostly due to both feature extractors using the same property names. And that would be asking for bugs when someone uses different configuration values.
SpeechT5ProcessorForSpeechToSpeech
(Writing this in case we want to fix this issue properly at some point.)
The problem: processor objects are assumed to have a tokenizer and a feature extractor. The config for the feature extractor is saved to preprocessor_config.json. However, SpeechT5ProcessorForSpeechToSpeech has no tokenizer and two feature extractors. As a result, the second feature extractor overwrites the JSON from the first.
In my opinion, the correct approach here would be to not hardcode the filename for feature extractors. Rather than using the FEATURE_EXTRACTOR_NAME constant, each feature extractor would get a class variable feature_extractor_config_name = FEATURE_EXTRACTOR_NAME. By default this is preprocessor_config.json but a class can override it if necessary. For the S2S model, we'd have preprocessor_encoder_config.json and preprocessor_decoder_config.json, for example.
However, the above solution would affect all of the models in Transformers, and it may still not work due to certain assumptions being made (i.e. you need to know the class name of the feature extractor so you can look up what its filename should be, which is a chicken-and-egg problem). So making this change just for SpeechT5 seems excessive at this point.
Hacks I've tried to work around this:
-
Override
save_pretrainedinSpeechT5ProcessorForSpeechToSpeechto save each feature extractor's config in a subdir. This works OK for saving. (It requires some changes to_upload_modified_filesso that it would pick up the config files inside these subdirs.)However, it does not work for loading, since the feature extractor's
from_pretraineddoes not know that it's supposed to look inside a subdir, and there's no way to tell it to do so. To fix this would require duplicating a lot of code fromFeatureExtractionMixIn. And even then, it doesn't work withAutoProcessor. -
Create a
FeatureExtractionMixinHackclass that extendsFeatureExtractionMixin. This duplicates the loading and saving code in order to save using different filenames for each feature extractor. The SpeechT5 feature extractors now extend from this. Very messy and brittle. Not even sure if works OK in all situations. -
Save the properties of both feature extractors in the same file, as nested dictionaries. This requires massive changes as the code is currently set up to save one file per object.
For now, the "solution" is not not use save_pretrained and from_pretrained with SpeechT5ProcessorForSpeechToSpeech and pretend everything is fine. 😅
To the reviewer: This PR has been reviewed by the audio team several times already, and this is (hopefully 😄) the final review before merging.
The only remaining thing is replacing the checkpoints with official ones. But I'd rather wait with creating these until the PR has been approved.
(We decided not to implement fine-tuning right now.)
Hi @sgugger, I made the fixes you asked for. There is now just one feature extractor / processor and the auto classes have been removed again.
(There are two tests that seem to fail but they're not related to this PR.)
@sgugger Hi Sylvain, I made the changes to the feature extractor you asked for. Also, the checkpoints have been updated to point to the microsoft organization on the hub. If all is good with you, feel free to merge this PR at your leisure. Thanks! 😄
(Again there is a failing test but this seems unrelated to this PR.)
EDIT: Unless @sanchit-gandhi wants to give this the final once-over too.
Good to merge for me too!
@sgugger I don't have rights to merge this myself. So someone else needs to press that button. 😅