Amphion icon indicating copy to clipboard operation
Amphion copied to clipboard

[Help]: Questions about the tokenizers of vevo1.5

Open josephwong14wkh opened this issue 8 months ago • 9 comments

Thank you for your great works! I am now working on audio-to-audio voice conversion and there are two questions regarding vevo1.5.

  1. I would like to train a content tokenizer for audio, replacing the text tokens during inference. Is the vocab size remains 32, as stated in the paper before?

  2. I am curious about why the vocab size of content-style tokenizer increased from 4096 (suggested in the paper) to 16384 (vevo1.5 setting), which will retain more timbre information as mentioned in the paper.

josephwong14wkh avatar Apr 15 '25 04:04 josephwong14wkh

@josephwong14wkh Thank you for raising such good questions!

New Idea of Vevo1.5 for Information Bottleneck

During the development of Vevo1.5, we refined the concept of Vevo's information bottleneck. Specifically, we posit that in the design of the tokenizer, the encoding of information is primarily influenced by two key factors: (1) ​​the source hidden features​​ (e.g., HuBERT for Vevo, and Whisper and Chromagram for Vevo1.5); (2) ​​the vector quantization (VQ) design​​, which encompasses both the frame rate (i.e., the downsampling rate for the source features) and the vocabulary size proposed by Vevo.

An intuitive interpretation is that the vocabulary size represents the ​​spatial bottleneck width​​, while the frame rate corresponds to the ​​temporal bottleneck width​​. Consequently, the ​​bitrate​​ (i.e., frame rate × log(vocabulary size)) may serve as a more meaningful metric than vocabulary size alone.

Regarding Q2

In Vevo, we employ ​​50Hz HuBERT tokens​​ (vocabulary size = 4096), yielding a bitrate of 600, as the content-style tokens. For Vevo1.5, despite adopting a larger vocabulary size (16,384 compared to Vevo’s 4,096), the frame rate is significantly lower (12.5Hz), resulting in a bitrate of 12.5 × log(16,384) = 175—which is actually lower than that of Vevo.

Note: The rationale behind reducing the frame rate was to shorten the sequence length during autoregressive (AR) prediction, thereby improving the speed and performance of the first-stage AR model.

​​Regarding Q1:

This remains an open question. We have not yet conducted systematic experiments to determine the optimal design for the content tokenizer in Vevo1.5. From a ​​feature selection​​ perspective, one could explore using ​​only Whisper features​​ (as opposed to the current approach in Vevo1.5, which combines Whisper and Chromagram features for content-style tokens).

From a ​​VQ design​​ standpoint, experimenting with a ​​6.25Hz vocabulary​​ (which may align more closely with the frame rate of "text") could be insightful. However, a vocabulary size of 32 at this frame rate might be small in my intuition. Instead, it may be necessary to account for language-specific characteristics (e.g., the number of characters, syllables, or words in a given language), though further empirical analysis would be required to validate this intuition.

RMSnow avatar Apr 18 '25 16:04 RMSnow

Can I say that Vevo1.5 does not support using audio alone as AR input? As I don't see the content tokenizer for the audio is provided.

liu4lin avatar May 22 '25 08:05 liu4lin

Yes, we need to train the content tokenizer ourselves in order to use audio as input to AR

josephwong14wkh avatar May 27 '25 10:05 josephwong14wkh

@RMSnow Thank you for your detailed explanation. Also, may i know why you set model.coco.codebook_dim = 8 in contentstyle_fvq16384_12.5hz.json, which is so small. As i know it is the dimension of each vector in the codebook. In RepCodec project, which you have referenced for, they set the dimension = hidden dim.

josephwong14wkh avatar May 27 '25 10:05 josephwong14wkh

Can I say that Vevo1.5 does not support using audio alone as AR input? As I don't see the content tokenizer for the audio is provided.

@liu4lin Yes. The text is required for the AR input.

RMSnow avatar Jun 03 '25 03:06 RMSnow

@RMSnow Thank you for your detailed explanation. Also, may i know why you set model.coco.codebook_dim = 8 in contentstyle_fvq16384_12.5hz.json, which is so small. As i know it is the dimension of each vector in the codebook. In RepCodec project, which you have referenced for, they set the dimension = hidden dim.

@josephwong14wkh A good question. I think it is a practical trick to improve the utilization rate of VQ codebook for the optimization. This implementation is also adopted by MaskGCT. You can see the following two papers for the detailed discussions:

  1. High-fidelity audio compression with improved RVQGAN. NeurIPS 2023.
  2. Vector-quantized image modeling with improved VQGAN. ICLR 2022.

RMSnow avatar Jun 03 '25 03:06 RMSnow

Thanks for the recommendations. I'll definitely check them out to dive deeper! I have another question about the training loss. I am training the tokenizer from scratch with my own data. Here's my training loss data, and I'm not sure if the trend looks good or not:

prosody tokenizer training loss: Image

content style tokenizer training loss Image

I'm concerned about the training loss because:

  1. The codebook loss initially rises and then falls—is this a good sign? (the loss first decrease to 0.01 and then increase to ~30-60, and decrease afterwards)
  2. The reconstruction loss is highly unstable (for both chromosome and whisper encoder features).

I see the same trend for content tokenizer training too. Thanks again for your great work!

josephwong14wkh avatar Jun 03 '25 04:06 josephwong14wkh

Hi @josephwong14wkh, I think the loss trend is resonable. You can refer to issue #440 to see my training loss curves.

RMSnow avatar Jun 03 '25 05:06 RMSnow

Got it! Thank you very much!

josephwong14wkh avatar Jun 03 '25 07:06 josephwong14wkh