StarGANv2-VC
StarGANv2-VC copied to clipboard
How to disentangle style and speaker information?
I would like to transfer speech style of one speaker and apply it to the another speaker, while preserving identity of the speaker.
Do you have any advice how to use it for emotional cross-speaker style transfer? I thought about adding additional discriminator to classify speaker id but how to define domains in such case?
Thanks
You can define the domains in terms of emotions instead of speakers. This way you can preserve the speakers but only convert emotions.
Thanks, Approach of defining domains as emotions instead of speakers worked but sometimes it messed up speaker identity for specific emotional domains. Found an interesting research for EVC based on StarGANv2-vc by Sony Research India: https://arxiv.org/pdf/2302.10536.pdf. They added second encoder and classifier for speaker domain for better disentanglement.
Maybe that's because a same people speaks too many same emotional sentences?
@yl4579 Hi,thanks for this project. I want to know if this domain is the emotional category of one speaker or many speakers?
You can define the domains in terms of emotions instead of speakers. This way you can preserve the speakers but only convert emotions.
@CONGLUONG12 It should be of multiple speakers. You can refer to https://arxiv.org/pdf/2302.10536.pdf for more details. This is a good example of how to modify StarGANv2-VC for emotion conversion.
@yl4579 Thank you very much. In your demo, you chose a speaker with a specific emotion. With this emotion, if you choose another speaker (call speaker A) included in the training set, you will have sound with this emotion and timbre of A?
@CONGLUONG12 Probably yes, if speaker A has samples in the training set with similar emotions, otherwise it might not work.
Hey There! I made something similar for my MSc Degree in AI starting from the great implementation of @yl4579 Take a look there for some hint Here