Amphion icon indicating copy to clipboard operation
Amphion copied to clipboard

[BUG]: Your implementation of S2A is not soundstorm

Open xliu99 opened this issue 1 year ago • 4 comments

Soundstorm is a single model that models each codebook hierarchically. It is not 2 models, in which the first one only models the first codebook, and the second modeling the rest.

xliu99 avatar Nov 21 '24 23:11 xliu99

Please kindly refer to the audiolm and soundstorm paper for their implementation, which I understand is more than a single model. Thanks!

jiaqili3 avatar Nov 24 '24 03:11 jiaqili3

Please kindly refer to the audiolm and soundstorm paper for their implementation, which I understand is more than a single model. Thanks!

In the soundstorm paper, they already obtain the semantic tokens from AudioLM. Their AudioLM tokens are equivalent to the T2S model output in MaskGCT. However, their S2A model, which is the soundstorm, is indeed a single model that generates all RVQ layers hierachically using one model. You probably confuse the AudioLM with a model that only generates the first RVQ codebook. That's why you break the S2A into two models.

xliu99 avatar Nov 25 '24 00:11 xliu99

In fact, the reason we used two models was simply that it was easier to debug at the initial experimental stage (we only needed to generate the acoustic token layer to reconstruct speech). We tried using one model, and there was no significant performance drop. I don't think it makes much difference.

HeCheng0625 avatar Dec 03 '24 06:12 HeCheng0625

Thank you very much for your answer.

xliu99 avatar Dec 04 '24 02:12 xliu99