glow-tts icon indicating copy to clipboard operation
glow-tts copied to clipboard

Add new speaker voice

Open marlon-br opened this issue 4 years ago • 5 comments

Hi Jaehyeon,

Could you please provide instructions how to use pretrained model and add new speaker voice?

I have created google colab file basing on your work: https://github.com/marlon-br/glow-tts-colab Now I want to add a possibility to have more speaker voices.

marlon-br avatar Jun 09 '20 08:06 marlon-br

Add these two hparams:

"n_speakers": 10,
"gin_channels": 16     

I'm not sure what the ideal value for gin_channels is to get a rich embedding, and I asked in another thread.

Your training data and validation CSVs should be in this format:

filename|numeric_speaker_id|transcript

You'll need to swap out the loader:

-from data_utils import TextMelLoader, TextMelCollate 
+from data_utils import TextMelSpeakerLoader, TextMelSpeakerCollate       

You'll also need to change the forward function to accept the g speaker id parameter and unpack the speaker ids from the loader enumerations.

echelon avatar Jun 10 '20 01:06 echelon

i meant not to retrain the whole model once again. only to add one more voice

marlon-br avatar Jun 15 '20 13:06 marlon-br

Add these two hparams:

"n_speakers": 10,
"gin_channels": 16     

I'm not sure what the ideal value for gin_channels is to get a rich embedding, and I asked in another thread.

Your training data and validation CSVs should be in this format:

filename|numeric_speaker_id|transcript

You'll need to swap out the loader:

-from data_utils import TextMelLoader, TextMelCollate 
+from data_utils import TextMelSpeakerLoader, TextMelSpeakerCollate       

You'll also need to change the forward function to accept the g speaker id parameter and unpack the speaker ids from the loader enumerations.

Sorry for jumping in, could you please elaborate the last part about changing the forward function? Thanks in advance!

dechubby avatar Sep 08 '20 08:09 dechubby

Add these two hparams:

"n_speakers": 10,
"gin_channels": 16     

I'm not sure what the ideal value for gin_channels is to get a rich embedding, and I asked in another thread.

Your training data and validation CSVs should be in this format:

filename|numeric_speaker_id|transcript

You'll need to swap out the loader:

-from data_utils import TextMelLoader, TextMelCollate 
+from data_utils import TextMelSpeakerLoader, TextMelSpeakerCollate       

You'll also need to change the forward function to accept the g speaker id parameter and unpack the speaker ids from the loader enumerations.

Hi @echelon , This information is really useful. I believe I've done necessary changes as suggested by you. In my case I've kept n_speakers = 24 and gin_channels = 256 and rest of the parameters in base.json is same. Number of samples in training records are 9102. I'm getting below runtime error.

RuntimeError: Given groups=1, weight of size 256 448 3, expected input[1, 192, 89] to have 448 channels, but got 192 channels instead

Can you please advice what is going wrong here.

ppanja avatar Jun 13 '21 22:06 ppanja

Hi @marlon-br, @dechubby , Were you able to run in multi speaker mode? Have you done any other changes apart from whatever mentioned by echelon? I'm getting some issue which I'm not able to debug.

Any help will be really appreciated.

Regards, Prasanta

ppanja avatar Jun 14 '21 09:06 ppanja