glow-tts
glow-tts copied to clipboard
Ideal size of gin_channels for multiple speaker embeddings?
Hi Jaehyeon, I modified your code to train multiple speakers and it seems to be training and inferring pretty well. Thanks for leaving the code in a state that makes this relatively easy!
Here are my hparams:
"n_speakers": 10,
"gin_channels": 16
I have nine speakers, but mistakenly didn't zero index them.
Is gin_channels
too small? Should this be appreciably larger to capture the voice characteristics? 32? 64? ...?
Two of the speakers have four hours of data. Other speakers have far less. Oddly, the speaker with the smallest amount of data seems to have one of the clearest voices. Other speakers don't sound like their source at all.
I'm only epoch 1400 in so far and I had to train from zero, so this has got a long way to go. Should I abandon this and increase gin_channels
, or does it seem fair to proceed?
@echelon Hi echelon.
As I haven't tested on such small datasets, I couldn't give you a solution. Sorry for that.
In my case, I didn't care much for the dimension, so I set gin_channels
to be big enough. Therfore, I trained my model on LibriTTS with 256 dimensional gin_channels
.
I think big gin_channels
does not harm in your case, either.
I hope it would be helpful for your case :)
Thanks so much for the feedback! 256 dimensions performs much better as far as I can tell.
It's perhaps a little premature to report my findings, but I've performed the following:
- Train 64
n_speaker
model with 256gin_channels
. All channels are trained and validated on LJS sample data distributed evenly across the speaker tokens [0,64), with 10% withheld for validation evenly across the same channels. - After training all 64 channels on LJS, substitute an arbitrary number of low-number speakers with novel data sets. (I'm currently training 10 voices.) The remaining speaker channels must continue to be trained on LJS or the model loses fit.
I'll report back when I've had longer to let this run on my two 1080 Ti cards, but the early results already seem promising.
Thanks so much for the feedback! 256 dimensions performs much better as far as I can tell.
It's perhaps a little premature to report my findings, but I've performed the following:
- Train 64
n_speaker
model with 256gin_channels
. All channels are trained and validated on LJS sample data distributed evenly across the speaker tokens [0,64), with 10% withheld for validation evenly across the same channels.- After training all 64 channels on LJS, substitute an arbitrary number of low-number speakers with novel data sets. (I'm currently training 10 voices.) The remaining speaker channels must continue to be trained on LJS or the model loses fit.
I'll report back when I've had longer to let this run on my two 1080 Ti cards, but the early results already seem promising.
Could you please make a small google colab file demonstrating how to add one more speaker? To be able to convert text to the voice of the new speaker. Thanks!