vits icon indicating copy to clipboard operation
vits copied to clipboard

Problems adding a new speaker

Open JoanisTriandafilidi opened this issue 2 years ago • 1 comments

Hello! Thanks for the great job! I actively use Vits in various personal mini-projects and I had an idea related to adding new speakers to the multi-speaker model.

The essence of my idea is this:

  1. I trained a good multispeaker model for 200 speakers.
  2. I received an embedding for a new speaker of a suitable format using Speakernet.
  3. I want to add a new speaker to an existing multispeaker model by adding a new embed. That is, emb_g.shape was equal to [200, 192], but will become [201, 192]. I'm adding a new embedding to the utils.load_checkpoint function.

The model loads without problems - however, on the inference, instead of the expected new (!) voice, I get one of the 200 already trained voices. Moreover, if I apply some other embedding to the input, I will get some other voice from these 200. So I can conclude that the model can potentially generate voices for artificially added speakers. But I can't get the voice to match the target.

Could you please tell me how I can solve this problem? Why, when the model sees a new embedding, does it generate a different voice?

JoanisTriandafilidi avatar Dec 18 '23 16:12 JoanisTriandafilidi