descript-audio-codec
descript-audio-codec copied to clipboard
Insight needed: Residual of code embedding in the high dimensional is not decreasing.
Dear Authors,
Thank you for publishing your work and making your code available online. It is of great value to the audio community.
I was curious about how using more or less quantizers affects the distance between the continuous and quantized embeddings in the high-dimensional embedding space. So I produced this code:
import dac
import torch
import torchaudio
model_path = dac.utils.download(model_type="44khz", model_bitrate="8kbps")
model = dac.DAC.load(model_path)
audio, sr = torchaudio.load("./audio_to_i/fileid_1888.flac")
model.eval()
z = model.encoder(audio.unsqueeze(0))
for i in range(9):
zq, codes, _, _, _ = model.quantizer(z, n_quantizers=i + 1)
print(f"{i=} , {torch.norm(z-zq).item() = }")
And I was very surprised to see that the norm is increasing with i! Do you have any explanation?
I understand that the distance to code entries is computed in the 8d low-dimensional space, but the 1024d residual should still get smaller the more RVQ scales we use?
Note: I also joined the audio I used in this test and some reconstruction using different number of RVQ scales and it works well. Download link: audio_to_i.zip