CaraDuf issues

Results 23 issues of


                                            CaraDuf

Why is bigvgan better than Avovodo ?

Hi, I finetune Toucan Meta model on 1k on a reduced dataset to understand the difference between Avocodo and BigVGan. Here are the spectrograms : ![image](https://user-images.githubusercontent.com/91517923/234183239-b05a2324-016c-41f5-9d5e-50f0234cc30a.png) Apart from the 12kHz...

New ToucanTTS model gives far worse results after finetuning

Hi, I tried to finetune the new Meta model on a 87 sample dataset in French that I used several times already but now the results is very bad. I...

How to read scorer output ?

Hi, I want to check what the scorer has to say about my dataset and why it is keeping only 77 samples out of 98 (which all sound ok to...

Use Pytorch Lightning to speed up training ?

Hi, In a previous [answer](https://github.com/DigitalPhonetics/IMS-Toucan/issues/109#issuecomment-1475011361) you wrote that you were looking for ways to improve training speed even though you were already satisfied with Toucan's training performance. Have you ever...

Why does utterance cloner need to know the reference transcription text ?

Hi, I tried the run_utterance_cloner and noticed very bad results when the transcription text does not match the reference audio. In another project I tried (Coqui) that also does voice...

Number of fine tuning steps recommended when fine tuning to avoid overfitting

Hi, Given a target speaker dataset what is roughly the number of fine tuning steps that should be undergone ? [NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb) "recommends 1000 steps per minute of audio for fastpitch...

CaraDuf

Why is bigvgan better than Avovodo ?

New ToucanTTS model gives far worse results after finetuning

How to read scorer output ?

Use Pytorch Lightning to speed up training ?

Why does utterance cloner need to know the reference transcription text ?

Number of fine tuning steps recommended when fine tuning to avoid overfitting

Read text from a specific speaker seen during training/finetuning ?

Where can I intercept the Mel Spectrogram to save it as .npy ?

How long in seconds should the speaker reference audio be ?

Running weight averaging during training seems to hang it