Example for raw audio
Hello, and thanks for the code! I want to replicate the audio results from the paper, but the DeepMind repo does not have a VQ-VAE example for audio (see https://github.com/google-deepmind/sonnet/issues/141 ), and it seems quite different from the one for CIFAR:
We train a VQ-VAE where the encoder has 6 strided convolutions with stride 2 and window-size 4. This yields a latent space 64x smaller than the original waveform. The latents consist of one feature map and the discrete space is 512-dimensional.
Could you please include an example of using your code for audio?
Why not take a look at AudioDec and Descript-Audio-Codec? They are open source.
Thank you @UkiTenzai . I checked the GitHub pages for both (https://github.com/facebookresearch/AudioDec and https://github.com/descriptinc/descript-audio-codec) and neither seems to do vocal cloning, i.e. voice neural transfer, right? That's what I would like to do with the VQ-VAE.
Thank you @UkiTenzai . I checked the GitHub pages for both (https://github.com/facebookresearch/AudioDec and https://github.com/descriptinc/descript-audio-codec) and neither seems to do vocal cloning, i.e. voice neural transfer, right? That's what I would like to do with the VQ-VAE.
Sorry, AudioDec and DAC are for compression. You can try SpeechTokenizor[https://github.com/ZhangXInFD/SpeechTokenizer/], which utilize a VQVAE and can be used for zero-shot VC. Altugh there are many similar VQVAE that surpass it, but they all basically improve on it. It was necessary to learn this one first.
Thank you. I checked the repo and it doesn't mention vocal cloning either, and an online search for SpeechTokenizer and vocal cloning did not show any applications, so I wouldn't know where to start. Could you please point an application or sample code using SpeechTokenizer for neural voice transfer?