acoustic-model icon indicating copy to clipboard operation
acoustic-model copied to clipboard

MultiSpeaker setup

Open rishikksh20 opened this issue 2 years ago • 6 comments

Have you try this on multi-speaker way ?

rishikksh20 avatar Aug 31 '22 12:08 rishikksh20

@bshall Can we also replace Encoder and Decoder with Transformers ?

rishikksh20 avatar Oct 26 '22 08:10 rishikksh20

Hi @rishikksh20, sorry about the delay on this. I only noticed this issue now.

I have tried a multi-speaker setup (about 10 speakers) using one-hot codes for each speaker. It works pretty well but I think there is a small degradation compared to the single speaker model. In my experience fine-tuning the acoustic model on a small amount of target data seems to work better. I haven't experimented with using speaker embeddings for a zero-shot model though so can't comment on how well it performs in that setting.

I'd imagine that using Transformers would be fine. I don't think such heavy machinery is required though. I have done some experiments training the Hifi-GAN directly on the soft units (augmented with the pitch contours) and this seems to work well. It also simplifies the pipeline since it makes the acoustic model unnecessary.

bshall avatar Oct 27 '22 09:10 bshall

@bshall Could you, please, tell more about HuBERT-to-HifiGAN experiments? What HifiGAN parameters should be changed? Did you use 256 dimension, like in HuBERT or did you retrain HuBERT with 128 dimension? How did you augmented soft units with pitch contours, somewhere in DataLoader or in Generator or Discriminator, where pitch was passed through nn.Embedding? Did you concatenated or added pitch contours to soft units?

juliakorovsky avatar Nov 08 '22 12:11 juliakorovsky

@rishikksh20 hi, did you try soft-unit for multispeaker setup for any-to-many voice conversion? if so, did you success? i'm trying just using one-hot codes for multi speaker setup now, but suffering from speaker identity degradation. even though result speech speech is quite audible.

seastar105 avatar Feb 08 '23 05:02 seastar105

yes I feel the same with my training

rishikksh20 avatar Feb 08 '23 08:02 rishikksh20

@rishikksh20 @seastar105 Have you tried with VITs/YourTTS as an acoustic model + vocoder with the multispeaker setting?

MuruganR96 avatar Feb 08 '23 08:02 MuruganR96