descript-audio-codec icon indicating copy to clipboard operation
descript-audio-codec copied to clipboard

Insight needed: Residual of code embedding in the high dimensional is not decreasing.

Open jhauret opened this issue 6 months ago • 0 comments

Dear Authors,

Thank you for publishing your work and making your code available online. It is of great value to the audio community.

I was curious about how using more or less quantizers affects the distance between the continuous and quantized embeddings in the high-dimensional embedding space. So I produced this code:

import dac
import torch
import torchaudio


model_path = dac.utils.download(model_type="44khz", model_bitrate="8kbps")
model = dac.DAC.load(model_path)


audio, sr = torchaudio.load("./audio_to_i/fileid_1888.flac")

model.eval()
z = model.encoder(audio.unsqueeze(0))
for i in range(9):
    zq, codes, _, _, _ = model.quantizer(z, n_quantizers=i + 1)
    print(f"{i=} , {torch.norm(z-zq).item() = }")

And I was very surprised to see that the norm is increasing with i! Do you have any explanation?

I understand that the distance to code entries is computed in the 8d low-dimensional space, but the 1024d residual should still get smaller the more RVQ scales we use?

Note: I also joined the audio I used in this test and some reconstruction using different number of RVQ scales and it works well. Download link: audio_to_i.zip

jhauret avatar Aug 08 '24 15:08 jhauret