[Help]: Questions about the tokenizers of vevo1.5
Thank you for your great works! I am now working on audio-to-audio voice conversion and there are two questions regarding vevo1.5.
-
I would like to train a content tokenizer for audio, replacing the text tokens during inference. Is the vocab size remains 32, as stated in the paper before?
-
I am curious about why the vocab size of content-style tokenizer increased from 4096 (suggested in the paper) to 16384 (vevo1.5 setting), which will retain more timbre information as mentioned in the paper.
@josephwong14wkh Thank you for raising such good questions!
New Idea of Vevo1.5 for Information Bottleneck
During the development of Vevo1.5, we refined the concept of Vevo's information bottleneck. Specifically, we posit that in the design of the tokenizer, the encoding of information is primarily influenced by two key factors: (1) the source hidden features (e.g., HuBERT for Vevo, and Whisper and Chromagram for Vevo1.5); (2) the vector quantization (VQ) design, which encompasses both the frame rate (i.e., the downsampling rate for the source features) and the vocabulary size proposed by Vevo.
An intuitive interpretation is that the vocabulary size represents the spatial bottleneck width, while the frame rate corresponds to the temporal bottleneck width. Consequently, the bitrate (i.e., frame rate × log(vocabulary size)) may serve as a more meaningful metric than vocabulary size alone.
Regarding Q2
In Vevo, we employ 50Hz HuBERT tokens (vocabulary size = 4096), yielding a bitrate of 600, as the content-style tokens. For Vevo1.5, despite adopting a larger vocabulary size (16,384 compared to Vevo’s 4,096), the frame rate is significantly lower (12.5Hz), resulting in a bitrate of 12.5 × log(16,384) = 175—which is actually lower than that of Vevo.
Note: The rationale behind reducing the frame rate was to shorten the sequence length during autoregressive (AR) prediction, thereby improving the speed and performance of the first-stage AR model.
Regarding Q1:
This remains an open question. We have not yet conducted systematic experiments to determine the optimal design for the content tokenizer in Vevo1.5. From a feature selection perspective, one could explore using only Whisper features (as opposed to the current approach in Vevo1.5, which combines Whisper and Chromagram features for content-style tokens).
From a VQ design standpoint, experimenting with a 6.25Hz vocabulary (which may align more closely with the frame rate of "text") could be insightful. However, a vocabulary size of 32 at this frame rate might be small in my intuition. Instead, it may be necessary to account for language-specific characteristics (e.g., the number of characters, syllables, or words in a given language), though further empirical analysis would be required to validate this intuition.
Can I say that Vevo1.5 does not support using audio alone as AR input? As I don't see the content tokenizer for the audio is provided.
Yes, we need to train the content tokenizer ourselves in order to use audio as input to AR
@RMSnow Thank you for your detailed explanation. Also, may i know why you set model.coco.codebook_dim = 8 in contentstyle_fvq16384_12.5hz.json, which is so small. As i know it is the dimension of each vector in the codebook. In RepCodec project, which you have referenced for, they set the dimension = hidden dim.
Can I say that Vevo1.5 does not support using audio alone as AR input? As I don't see the content tokenizer for the audio is provided.
@liu4lin Yes. The text is required for the AR input.
@RMSnow Thank you for your detailed explanation. Also, may i know why you set
model.coco.codebook_dim = 8incontentstyle_fvq16384_12.5hz.json, which is so small. As i know it is the dimension of each vector in the codebook. In RepCodec project, which you have referenced for, they set the dimension =hidden dim.
@josephwong14wkh A good question. I think it is a practical trick to improve the utilization rate of VQ codebook for the optimization. This implementation is also adopted by MaskGCT. You can see the following two papers for the detailed discussions:
- High-fidelity audio compression with improved RVQGAN. NeurIPS 2023.
- Vector-quantized image modeling with improved VQGAN. ICLR 2022.
Thanks for the recommendations. I'll definitely check them out to dive deeper! I have another question about the training loss. I am training the tokenizer from scratch with my own data. Here's my training loss data, and I'm not sure if the trend looks good or not:
prosody tokenizer training loss:
content style tokenizer training loss
I'm concerned about the training loss because:
- The codebook loss initially rises and then falls—is this a good sign? (the loss first decrease to 0.01 and then increase to ~30-60, and decrease afterwards)
- The reconstruction loss is highly unstable (for both chromosome and whisper encoder features).
I see the same trend for content tokenizer training too. Thanks again for your great work!
Hi @josephwong14wkh, I think the loss trend is resonable. You can refer to issue #440 to see my training loss curves.
Got it! Thank you very much!