autovc icon indicating copy to clipboard operation
autovc copied to clipboard

AutoVC on a large scale data?

Open iyah4888 opened this issue 5 years ago • 2 comments

Hi @auspicious3000, thanks for sharing your research code. I've worked on a lot of time to make the training code work (mostly due to input hyper parameter issues as the other guys are also struggling). I'm currently working on the VoxCeleb2 dataset (near 6000 speaker, with 1M utterances). However, I cannot make it trainable with MSE loss, but with L1 loss, I can manage to get the following auto-encoding reconstruction.

[Original] image [Voice converted with another speaker embedding] image

The problem is while the network learns auto-encoding, but during the test time, it is not generalizable to voice conversion. It just did auto-encoding, not something else. The above pair of examples are voice conversion examples, where both fundamental frequency of the mel-spectrogram looks very similar.

Could you share your your experience or any comments? I'd appreciate.

iyah4888 avatar Sep 01 '19 18:09 iyah4888

For different dataset you need to retune the bottleneck. Also, feel free to try different encoder and decoder architectures. The paper proposed a framework instead of specific architectures. Voxceleb2 is not very clean, for example, if the channel effects and background noises are different, you need to disentangle them by conditioning on these information. Otherwise, it will not achieve disentangled representations for conversion. I suggest you start with a clean dataset such as vctk.

auspicious3000 avatar Sep 03 '19 11:09 auspicious3000

Thanks for sharing. From my experience, the temporal resolution of the bottleneck feature (related to mel-spectrogram extraction hop-length and the downsampling frequency) seems to be important for the encoder to disentangle. When I extracted mel-spectrogram with hop-length of 250, the down-sampling frequency 32 shows better performance in conversion than the down-sampling frequency of 16. Currently, I extract mel-spectrogram with hop-length of 200 and increase down-sampling frequency to 40, the conversion performance is still worse than 250 hop and 32 freq.

light1726 avatar Sep 11 '19 03:09 light1726