Cross-Speaker-Emotion-Transfer
Cross-Speaker-Emotion-Transfer copied to clipboard
Synthesis with other person out of RAVDESS
Hello author, Firstly, thank you for giving this repo, it is really nice. I have a question that:
- I download CMU data with single person with 100 audios and make speaker embedding vector and synthesis with this, the performance is not good. I cannot detect any words.
- Should we need to fine-tuning deep-speaker model to generate speaker embedding with my data.
Thank you
Hi @hathubkhn , thanks for your attention.
- There could be a various reason for such case. Could you please share the tensorboard logs and some samples audio with mel-spectrogram?
- It might be, but it depends on the number of speakers and their features.
Hi,
- When waiting for your response, I try to finetune in LJSPEECH data and I can synthesize the sentence but it is not high quality. I will attach my Mel-spectrogram below and please help me to find out how to improve
- I want to use your repo to make voice cloning, I am not sure it cannot, so that based on yourTTS I make another loss for speaker similarity. And training from scratch. Is it possible?
Here is my training
from scratch when adding speaker loss (SCL-speaker consistance loss) and training with LJSPEECH
Ah, so sorry for the late response. I thought I replied to your comments.
- It might be due to the light weight conv. Replacing it with normal transformer block will resolve the quality issue.
- Yes, if the lambda (weight of each loss) is carefully assigned by some experiments.
@keonlee9420,
About your last point 1 for a potential solution of the quality issue, can you provide an example for replacing the light weight conv. with normal transformer block? Thanks