transformers
                                
                                 transformers copied to clipboard
                                
                                    transformers copied to clipboard
                            
                            
                            
                        add VITS model
What does this PR do?
Adds the VITS model for text-to-speech, in particular to support the MMS-TTS checkpoints (which use the same model architecture but a different tokenizer).
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Notes about the tokenizer:
- This is not the VITS tokenizer but the one for MMS-TTS.
- The vocab doesn't have padding (or unknown) tokens in it, but uses token_id 0 for this. That breaks on the HF tokenizers because it will split the input text on the padding token, so if I set pad_token_id = 0then the letters that token_id 0 corresponds to will disappear from the text.
- To fix this issue, I'm adding <pad>and<unk>to the vocab, but then in the model we set such token_ids to 0 before feeding the input into the first layer. It's a bit hacky. Ideas for a nicer solution are appreciated.
- The tokenizer also inserts an additional token_id 0 in between every token. No idea why but that's how it works.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
This is ready for a first review yet.
Two checkpoints are currently available:
- https://huggingface.co/Matthijs/mms-tts-eng
- https://huggingface.co/Matthijs/mms-tts-nld
Small usage example:
from transformers import VitsMmsTokenizer, VitsModel
import torch
tokenizer = VitsMmsTokenizer.from_pretrained("Matthijs/mms-tts-eng")
model = VitsModel.from_pretrained("Matthijs/mms-tts-eng")
    
inputs = tokenizer(text="Hello, my dog is cute", return_tensors="pt")
outputs = model(inputs["input_ids"])
speech = outputs.audio
The current model is the MMS-TTS version, not the original VITS version. The conversion scripts can handle both, but for original VITS support the tokenizer is still missing.
Still needs to be done:
- tests
- tokenizer for actual VITS
@Vaibhavs10 For this review, in particular could you verify the names of the layers in the flow layers etc make sense? Thanks!
Some of the MMS-TTS checkpoints require the use of the tool uromanize from https://github.com/isi-nlp/uroman to convert the input script into the Latin alphabet. Since this is a separate Perl script, it is not included in Transformers and the user will have to run uromanize.pl themselves before using the tokenizer.
I'm not too sure why I'm asked for a review here as all comments from @sanchit-gandhi are being ignored.
No they aren't?! I've integrated most of his suggestions and replied with counterarguments otherwise.
Tokenizer can now handle both the original VITS models (which require phonemization) and the MMS-TTS models.
Hey @sgugger / @amyeroberts - this one is ready for a review! We've got one open discussion around variable namings: https://github.com/huggingface/transformers/pull/24085#discussion_r1243884355
But otherwise the comments have been resolved and the code cleaned-up. Please address any comments / suggestions to myself, as I'll be taking over this PR for the rest of the integration
Would be really great to get your review here @amyeroberts! We're aiming to have this model feature as part of the next Unit of the audio transformers course 🤗 https://github.com/huggingface/audio-transformers-course/pull/61
This is ready for a second look @amyeroberts
It would be awesome to get a second look here @amyeroberts before you go on leave!
I just installed transformers from this branch, but I'm having some issues both with the provided examples and with the course code. Here is a minimal reproduction https://colab.research.google.com/drive/1nyCvTpAhS89_LgY2JdxSeCSBCbhzMWC3?usp=sharing
- With example from https://hf.co/learn/audio-course/chapter6/pre-trained_models#massive-multilingual-speech-mms
from transformers import VitsModel, VitsTokenizer
import torch
model = VitsModel.from_pretrained("Matthijs/mms-tts-deu")
tokenizer = VitsTokenizer.from_pretrained("Matthijs/mms-tts-deu")
text_example = (
    "Ich bin Schnappi das kleine Krokodil, komm aus Ägypten das liegt direkt am Nil."
)
inputs = tokenizer(text_example, return_tensors="pt")
input_ids = inputs["input_ids"]
with torch.no_grad():
    outputs = model(input_ids)
speech = outputs.audio[0]
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
fails with index out of range
- With example from docs
from transformers import VitsTokenizer
tokenizer = VitsTokenizer.from_pretrained("sanchit-gandhi/mms-tts-eng")
inputs = tokenizer(text="Hello, my dog is cute", return_tensors="pt")
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
Hey @osanseviero - for the example in the course, did you pip install from the specific commit ID listed in the course instructions? The structure of the weights have changed, so the latest code isn't compatible with the weights pushed under the repo id "Matthijs/mms-tts-deu". So to use the latest commit, we need to use an updated version of the weights. The API is also a bit different, since it's still a WIP PR. For example, the .audio return field has now been replaced by .waveform. I think the best thing would be to wait until this PR gets its final reviews and is merged before committing to example use cases! Hopefully it's not long now!
Thanks for highlighting the tokenizer issue - will take a look at why that's failing! That's indeed a bug that needs to be fixed before merge (shouldn't block the next review though!)
All points addressed so ready for a final review @amyeroberts. Thanks for your in-depth reviews here - the PR looks in pretty good shape!
Hey @amyeroberts - to have full compatibility with the text-to-audio pipeline class, we need to indicate the sampling_rate of the predicted audio waveforms in the model config:
https://github.com/huggingface/transformers/blob/2be8a9098e06262bdd5c16b5e8a70f145df88e96/src/transformers/pipelines/text_to_audio.py#L82
The sampling_rate corresponds to the sampling rate of the target audio that the model was trained on. It is not possible to determine in any way other than from the value in the original config of the model. MMS TTS models use a sampling rate of 16kHz, VITS TTS models use a sampling rate of 22kHz, but otherwise their configs are the same. The user needs to have an idea of the sampling rate that the model generates in order to know what rate to playback the audio, otherwise this leaves them prone to silent errors. IMO adding it as an attribute of the main model class should suffice here:
https://github.com/huggingface/transformers/blob/ff3b08c3b2b5b33651f30356e634a5efca1c5f2a/src/transformers/models/vits/modeling_vits.py#L1374
Note that we cannot just add the sampling_rate to the config and not the modelling file, this is not allowed by the CI:
https://app.circleci.com/pipelines/github/huggingface/transformers/71815/workflows/c0[…]cb-a064-7c1019e03630/jobs/904373/parallel-runs/0/steps/0-116
cc @ylacombe
As discussed offline with @amyeroberts, we'll add it as an allowed attribute in the config checker: https://github.com/huggingface/transformers/pull/24085/commits/8b01633bccd298d3f9ff8f628b75336202fc53c4
Hi @hollance Thank you for adding this model into transformers 🤗 .
There is a test failing
python3 -m pytest -v tests/models/vits/test_modeling_vits.py::VitsModelTest::test_initialization
I skip it on the main branch. Would you like to help us investigate this test if you have some bandwidth? Otherwise we can take this on our side too.
If you decide to take a look, you have to remove  the following line
https://github.com/huggingface/transformers/blob/ab8cba824e3887d90cb9f4d5866fde9243f2c9fe/tests/models/vits/test_modeling_vits.py#L172
so the test will be collected and run by pytest.
Let me know :-) Thank you!