IMS-Toucan
IMS-Toucan copied to clipboard
Is there a way to change Speaker Embedding layer to other Models
Hi is there any chance we can change the Speaker Embedding layer from current Speechbrain's ECAPA-TDNN and Speechbrain's x-Vector to some other models like the Speaker Embedding model from Coqui TTS
With the current model sometimes the output voice gender is Female even when the input reference audio gender is Male. So I want to try some other Speaker Embedding models too.
And do we need to change the Sample rate of the reference file to 16K before passing to tts.set_utterance_embedding(path_to_reference_audio=reference)
function?
Thanks
Hi! I never had issues with the speaker embeddings we are currently using sounding like the incorrect on the masculinity/femininity scale. An earlier version used an instance of the same GE2E loss that Coqui are using, which can be found here: https://github.com/yistLin/dvector
I found however that the combination of ecapa and x-vector worked slightly better in my experiments. It is however fairly easy to exchange the speaker embedding, all you need to change is the interface to the speaker embedding model here: https://github.com/DigitalPhonetics/IMS-Toucan/blob/2cd5d893639e8d4bfa9acffa09a519b37a908768/Preprocessing/ProsodicConditionExtractor.py#L11 and the dimensionality of the new speaker embedding here: https://github.com/DigitalPhonetics/IMS-Toucan/blob/2cd5d893639e8d4bfa9acffa09a519b37a908768/InferenceInterfaces/InferenceArchitectures/InferenceFastSpeech2.py#L72 and here: https://github.com/DigitalPhonetics/IMS-Toucan/blob/2cd5d893639e8d4bfa9acffa09a519b37a908768/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/FastSpeech2.py#L100
When passing a reference sample to the set_utterance_embedding function, it can be in any sample rate. The function will load the sample from the file and at this point it knows the sampling rate and resamples it on its own to the one that is needed.
While we found that the current speaker embedding setup works pretty well in terms of target voice similarity, this actually is one of the points we are currently working on improving. It's still in the middle of experimentation, but we hope to have the next version that includes this out by November.
Thanks for the response I tried changing the dimensions and using the model from the above link with the released model but got the following error
Error(s) in loading state_dict for FastSpeech2: size mismatch for encoder.embedding_projection.0.weight: copying a param with shape torch.Size([128, 704]) from checkpoint, the shape in current model is torch.Size([128, 256]).
I think since the released FastSpeech2 model is trained on embedding dimensions of 704 it will not work on the new dimensions(256). Do you think is there any way to use the current model without retraining it completely from scratch may be some transfer learning possibility?
You're right, the pretrained model won't work with a different embedding function. You can train a simple monolingual model from scratch, that only takes about a day and can be done on a single GPU. It does however require a fair bit of RAM, since I keep the whole dataset in RAM, sincew that's usually not the bottleneck and it makes it easier to be fast.
Stitching weigths from the chekpoint together with a different embedding function should also be doable, but I've never done that, so I can't tell you how. encoder.embedding_projection should be the only thing that needs to be exchanged, so I'm sure it's not too bad.
I'm unsure however whether the model will learn to adapt to the new embedding function without forgetting everything else first and then re-learning everything. I can also tell you from experiments that simply freezing everything but the embedding projection by excluding the parameter groups from the optimizer when training with a new embedding function likely will not work, I tried finetuning only parts of the model at some point and the model seems to need the flexibility of changing downstream layers when an input projection is updated.
Thanks for the quick reply I actually tried replacing xvector part of the embedding (512) with 2 instances of d-vector from the above model (2×256) and the results improved a little bit when compared to xvector. Even though there are still some samples with the Gender conversion issue since the main gender-related data seems to be coming from ecapa part of the embeddings but the overall quality of the audio especially when using unseen languages improved.
I would like to know your thoughts on this combination of embeddings too.
Thanks
For how long did you finetune the model with the new speaker embeddings in place and on which data? Changes in quality usually come from higher quality finetuning data used, but that usually comes with some other drawbacks, like forgetting to speak some language when finetuning only on English.
I actually didn't fine-tune the model, I simply replaced xvector embeddings with dvector embeddings in the following function
IMS-Toucan/Preprocessing/ProsodicConditionExtractor.py
Line 11 in 2cd5d89
class ProsodicConditionExtractor:
and tried with unseen languages like some of the Indian languages and I saw an improvement in the pronunciation quality than with xvector at least for the test sentences I used. I don't know if the quality improvement is applicable in every scenario but I am still experimenting.
But I did fine-tune the Aligner and FastSpeech2 models with Indian languages data on inhouse dataset using the original embeddings and the quality is decent there are still errors with some of the pronunciations but I think it will get better if I can train on a bigger dataset.
Aligner seems to overfit too easily on my dataset the loss started increasing too much just after 5 epochs, So I stopped training at 5 epochs.
And I also noticed a reduction in the utterance cloner quality for the fine-tuned FastSpeech2 model possibly due to low number of speakers in my dataset.
finetuning data used, but that usually comes with some other drawbacks, like forgetting to speak some language when finetuning only on English.
And you are also right finetuning on new languages makes the model forget or disturb existing ones. I have noticed English and other old languages are not working properly on the model finetuned with the new languages dataset
Would like here your input too.
Thanks
If you exchange the speaker embedding function without finetuning the FastSpeech2 model on the new speaker embeddings, you might as well put in completely random numbers. The dimensions in the speaker embeddings are not interpretable, they are a latent representation, so when you exchange the speaker embeddings, the model does not know at all what to do with them.
The Aligner is not overfitting, the overall loss is increasing because the reconstruction loss is scaled up over the course of training. To see whether the aligner works well enough to be used, refer to the progress plots of the posteriograms in the directory you save the aligner to.
A reduction in utterance cloner quality can mean lots of things, 'quality' is very ambiguous in this case, but I assume you mean the similarity to the target speaker is reduced? That can very well be the case if you finetune on a small amount of speakers. Is it the same when you just set the utterance embedding and run the TTS normally without the full prosody cloning etc?
I will try finetuning on the dvector embeddings too then and see the results.
I will continue training the Aligner and check posteriograms instead of purely depending on loss
By reduction in quality, I mean both similarity to the target speaker and also few words being mispronounced when using full prosody. What do you think is the ideal amount of finetuning required for FastSpeech2 model when finetuning on around 8 Hrs of data?
Thanks
With 8 hours of data you don't even need to finetune, anything over 5 hours is usually good enough to train from scratch. So I'm not sure if the ideal amount of steps is even limited. Probably after 30k steps you won't see further improvements is my guess. The checkpoint for finetuning is more meant to help with cases when you have less than 1 hour of data available.
Do those pronunciation errors you mention only appear when you run the utterance cloning that clones the utterance of the reference phone by phone, or do these problems also happen when you run the TTS normally without the prosody reference?
Thanks for the tip I will try training a model from scratch, I tried finetuning on d-vector embeddings and it solved most of my gender conversion issue but as you already mentioned before, the overall clarity and similarity to the source speaker is a bit better in combination of ecapa and x-vector embeddings. But I want to experiment more with the d-vector and see if the quality can be improved.
The pronunciation errors only occur when I run the utterance cloning that clones the utterance of the reference phone by phone, normal TTS without prosody reference works fine. I think the error is mainly due to prosody extraction issues, since I finetuned the FastSpeech2 model on less number of speakers it may be causing issues when trying to adapt to an unknown speaker.
Thanks for the quick reply I actually tried replacing xvector part of the embedding (512) with 2 instances of d-vector from the above model (2×256) and the results improved a little bit when compared to xvector. Eventhough there are still some samples with the Gender conversion issue since the main gender related data seems to be coming from ecapa part of the embeddings but the overall quality of the audio especially when using with unseen languages improved.
I would like to know your thoughts on this combination of embeddings too.
Thanks
On Tue, 12 Jul, 2022, 04:28 Flux9665, @.***> wrote:
You're right, the pretrained model won't work with a different embedding function. You can train a simple monolingual model from scratch, that only takes about a day and can be done on a single GPU. It does however require a fair bit of RAM, since I keep the whole dataset in RAM, sincew that's usually not the bottleneck and it makes it easier to be fast.
Stitching weigths from the chekpoint together with a different embedding function should also be doable, but I've never done that, so I can't tell you how. encoder.embedding_projection should be the only thing that needs to be exchanged, so I'm sure it's not too bad.
I'm unsure however whether the model will learn to adapt to the new embedding function without forgetting everything else first and then re-learning everything. I can also tell you from experiments that simply freezing everything but the embedding projection by excluding the parameter groups from the optimizer when training with a new embedding function likely will not work, I tried finetuning only parts of the model at some point and the model seems to need the flexibility of changing downstream layers when an input projection is updated.
— Reply to this email directly, view it on GitHub https://github.com/DigitalPhonetics/IMS-Toucan/issues/35#issuecomment-1181011501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGT4COQYKX2TF7E675JNSETVTSRJ7ANCNFSM53E6UVYA . You are receiving this because you authored the thread.Message ID: @.***>