pytorch-vqvae Example for raw audio

Hello, and thanks for the code! I want to replicate the audio results from the paper, but the DeepMind repo does not have a VQ-VAE example for audio (see https://github.com/google-deepmind/sonnet/issues/141 ), and it seems quite different from the one for CIFAR:

We train a VQ-VAE where the encoder has 6 strided convolutions with stride 2 and window-size 4. This yields a latent space 64x smaller than the original waveform. The latents consist of one feature map and the discrete space is 512-dimensional.

Could you please include an example of using your code for audio?

Dec 03 '23 17:12 mm3509

Why not take a look at AudioDec and Descript-Audio-Codec? They are open source.

Mar 13 '25 13:03 UkiTenzai

Thank you @UkiTenzai . I checked the GitHub pages for both (https://github.com/facebookresearch/AudioDec and https://github.com/descriptinc/descript-audio-codec) and neither seems to do vocal cloning, i.e. voice neural transfer, right? That's what I would like to do with the VQ-VAE.

Mar 14 '25 12:03 mm3509

Thank you @UkiTenzai . I checked the GitHub pages for both (https://github.com/facebookresearch/AudioDec and https://github.com/descriptinc/descript-audio-codec) and neither seems to do vocal cloning, i.e. voice neural transfer, right? That's what I would like to do with the VQ-VAE.

Sorry, AudioDec and DAC are for compression. You can try SpeechTokenizor[https://github.com/ZhangXInFD/SpeechTokenizer/], which utilize a VQVAE and can be used for zero-shot VC. Altugh there are many similar VQVAE that surpass it, but they all basically improve on it. It was necessary to learn this one first.

Mar 15 '25 08:03 UkiTenzai

Thank you. I checked the repo and it doesn't mention vocal cloning either, and an online search for SpeechTokenizer and vocal cloning did not show any applications, so I wouldn't know where to start. Could you please point an application or sample code using SpeechTokenizer for neural voice transfer?

Mar 15 '25 16:03 mm3509