What does this PR do?

Adds the VITS model for text-to-speech, in particular to support the MMS-TTS checkpoints (which use the same model architecture but a different tokenizer).

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Jun 07 '23 15:06 hollance

Notes about the tokenizer:

This is not the VITS tokenizer but the one for MMS-TTS.
The vocab doesn't have padding (or unknown) tokens in it, but uses token_id 0 for this. That breaks on the HF tokenizers because it will split the input text on the padding token, so if I set pad_token_id = 0 then the letters that token_id 0 corresponds to will disappear from the text.
To fix this issue, I'm adding <pad> and <unk> to the vocab, but then in the model we set such token_ids to 0 before feeding the input into the first layer. It's a bit hacky. Ideas for a nicer solution are appreciated.
The tokenizer also inserts an additional token_id 0 in between every token. No idea why but that's how it works.

Jun 13 '23 14:06 hollance

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Jun 13 '23 15:06 HuggingFaceDocBuilderDev

This is ready for a first review yet.

Two checkpoints are currently available:

https://huggingface.co/Matthijs/mms-tts-eng
https://huggingface.co/Matthijs/mms-tts-nld

Small usage example:

from transformers import VitsMmsTokenizer, VitsModel
import torch

tokenizer = VitsMmsTokenizer.from_pretrained("Matthijs/mms-tts-eng")
model = VitsModel.from_pretrained("Matthijs/mms-tts-eng")
    
inputs = tokenizer(text="Hello, my dog is cute", return_tensors="pt")

outputs = model(inputs["input_ids"])
speech = outputs.audio

The current model is the MMS-TTS version, not the original VITS version. The conversion scripts can handle both, but for original VITS support the tokenizer is still missing.

Still needs to be done:

tests
tokenizer for actual VITS

@Vaibhavs10 For this review, in particular could you verify the names of the layers in the flow layers etc make sense? Thanks!

Jun 22 '23 15:06 hollance

Some of the MMS-TTS checkpoints require the use of the tool uromanize from https://github.com/isi-nlp/uroman to convert the input script into the Latin alphabet. Since this is a separate Perl script, it is not included in Transformers and the user will have to run uromanize.pl themselves before using the tokenizer.

Jun 27 '23 11:06 hollance

I'm not too sure why I'm asked for a review here as all comments from @sanchit-gandhi are being ignored.

No they aren't?! I've integrated most of his suggestions and replied with counterarguments otherwise.

Jun 27 '23 14:06 hollance

Tokenizer can now handle both the original VITS models (which require phonemization) and the MMS-TTS models.

Jun 29 '23 09:06 hollance

Hey @sgugger / @amyeroberts - this one is ready for a review! We've got one open discussion around variable namings: https://github.com/huggingface/transformers/pull/24085#discussion_r1243884355

But otherwise the comments have been resolved and the code cleaned-up. Please address any comments / suggestions to myself, as I'll be taking over this PR for the rest of the integration

Jun 29 '23 12:06 sanchit-gandhi

Would be really great to get your review here @amyeroberts! We're aiming to have this model feature as part of the next Unit of the audio transformers course 🤗 https://github.com/huggingface/audio-transformers-course/pull/61

Jul 04 '23 08:07 sanchit-gandhi

This is ready for a second look @amyeroberts

Jul 26 '23 11:07 sanchit-gandhi

It would be awesome to get a second look here @amyeroberts before you go on leave!

Aug 18 '23 14:08 sanchit-gandhi

I just installed transformers from this branch, but I'm having some issues both with the provided examples and with the course code. Here is a minimal reproduction https://colab.research.google.com/drive/1nyCvTpAhS89_LgY2JdxSeCSBCbhzMWC3?usp=sharing

With example from https://hf.co/learn/audio-course/chapter6/pre-trained_models#massive-multilingual-speech-mms

from transformers import VitsModel, VitsTokenizer
import torch

model = VitsModel.from_pretrained("Matthijs/mms-tts-deu")
tokenizer = VitsTokenizer.from_pretrained("Matthijs/mms-tts-deu")

text_example = (
    "Ich bin Schnappi das kleine Krokodil, komm aus Ägypten das liegt direkt am Nil."
)

inputs = tokenizer(text_example, return_tensors="pt")
input_ids = inputs["input_ids"]

with torch.no_grad():
    outputs = model(input_ids)

speech = outputs.audio[0]

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

fails with index out of range

With example from docs

from transformers import VitsTokenizer

tokenizer = VitsTokenizer.from_pretrained("sanchit-gandhi/mms-tts-eng")
inputs = tokenizer(text="Hello, my dog is cute", return_tensors="pt")

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Aug 19 '23 20:08 osanseviero

Hey @osanseviero - for the example in the course, did you pip install from the specific commit ID listed in the course instructions? The structure of the weights have changed, so the latest code isn't compatible with the weights pushed under the repo id "Matthijs/mms-tts-deu". So to use the latest commit, we need to use an updated version of the weights. The API is also a bit different, since it's still a WIP PR. For example, the .audio return field has now been replaced by .waveform. I think the best thing would be to wait until this PR gets its final reviews and is merged before committing to example use cases! Hopefully it's not long now!

Thanks for highlighting the tokenizer issue - will take a look at why that's failing! That's indeed a bug that needs to be fixed before merge (shouldn't block the next review though!)

Aug 21 '23 17:08 sanchit-gandhi

All points addressed so ready for a final review @amyeroberts. Thanks for your in-depth reviews here - the PR looks in pretty good shape!

Aug 24 '23 17:08 sanchit-gandhi

Hey @amyeroberts - to have full compatibility with the text-to-audio pipeline class, we need to indicate the sampling_rate of the predicted audio waveforms in the model config: https://github.com/huggingface/transformers/blob/2be8a9098e06262bdd5c16b5e8a70f145df88e96/src/transformers/pipelines/text_to_audio.py#L82

The sampling_rate corresponds to the sampling rate of the target audio that the model was trained on. It is not possible to determine in any way other than from the value in the original config of the model. MMS TTS models use a sampling rate of 16kHz, VITS TTS models use a sampling rate of 22kHz, but otherwise their configs are the same. The user needs to have an idea of the sampling rate that the model generates in order to know what rate to playback the audio, otherwise this leaves them prone to silent errors. IMO adding it as an attribute of the main model class should suffice here:

https://github.com/huggingface/transformers/blob/ff3b08c3b2b5b33651f30356e634a5efca1c5f2a/src/transformers/models/vits/modeling_vits.py#L1374

Note that we cannot just add the sampling_rate to the config and not the modelling file, this is not allowed by the CI: https://app.circleci.com/pipelines/github/huggingface/transformers/71815/workflows/c0[…]cb-a064-7c1019e03630/jobs/904373/parallel-runs/0/steps/0-116

cc @ylacombe

Aug 31 '23 14:08 sanchit-gandhi

As discussed offline with @amyeroberts, we'll add it as an allowed attribute in the config checker: https://github.com/huggingface/transformers/pull/24085/commits/8b01633bccd298d3f9ff8f628b75336202fc53c4

Sep 01 '23 09:09 sanchit-gandhi

Hi @hollance Thank you for adding this model into transformers 🤗 .

There is a test failing

python3 -m pytest -v tests/models/vits/test_modeling_vits.py::VitsModelTest::test_initialization

I skip it on the main branch. Would you like to help us investigate this test if you have some bandwidth? Otherwise we can take this on our side too.

If you decide to take a look, you have to remove the following line https://github.com/huggingface/transformers/blob/ab8cba824e3887d90cb9f4d5866fde9243f2c9fe/tests/models/vits/test_modeling_vits.py#L172 so the test will be collected and run by pytest.

Let me know :-) Thank you!

Sep 04 '23 09:09 ydshieh

transformers
transformers copied to clipboard

add VITS model

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

add VITS model

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard