audiolm-pytorch Adapting AudioLM to support SingSong style accompaniment generation

Hi @lucidrains - thanks for your awesome work here. Great stuff as always.

I recently came across Google's new SingSong paper (https://arxiv.org/pdf/2301.12662.pdf), in which they adapt AudioLM for generation of instrumental accompaniments conditioned upon sung input vocals, and I was wondering if you (or anyone else 🙂 ) might have any practical advice on implementing the adaptations necessary.

Also, to this end, would you happen to know if anyone has managed to train a decent soundstream model and made it publicly available yet?

Best, and thanks again for your work here, Shaun

Feb 11 '23 22:02 smcio

Yea I can take of the paper

Also, to this end, would you happen to know if anyone has managed to train a decent soundstream model and made it publicly available yet?

Not yet, but I reckon we will, given my sources :smile:

Feb 12 '23 15:02 lucidrains

yea, both this and spear-tts may be too complicated to fit in this repository

i think many audio researchers are forking the audiolm repository within google and extending it to their own work, due to its success

Feb 27 '23 20:02 lucidrains

@lucidrains I am also interested in singsong , and now prepare to train a FineTransformer model first. I am wondering 1,How many data are needed for trainning FineTransformer model？ 2,How many steps are needed? 3,Is the code FineTransformerTrainer avaiable for the training? I revised part of the code, mostly interface and dataloader part to adapt to my own data , I just start to train the model, but I donot know if using this code is ok for singsong or musiclm.

Mar 07 '23 08:03 Liujingxiu23

@Liujingxiu23 ah, you can't just skip to the fine transformer. this work resembles a Matryoshka doll. you will need to train soundstream and 2 transformers successfully before even arriving at fine transformer as well as all the extra singsong networks.

Mar 07 '23 15:03 lucidrains

this is why having open sourced foundation models are so important. no one else but internal google teams are able to carry out research

Mar 07 '23 15:03 lucidrains

@lucidrains I use codes generated by the Encodec model of facebook, 24khz, 16-codebook, 6 for coarse ad 10 for fine, is that ok?

And about the code, I understand the generate process, code are generated one by one as a LOOP, but I do not understand the training process, why fine codes are send to the model to ? I do not see relative code of "transformer-decoder", I mean each fine code can only see its former codes and its own coarse code in the training process. Which part of code reflects this logic？I am sorry my previous work did not involve "transformer-decoder", I can not easily tell this.

Mar 09 '23 07:03 Liujingxiu23

did someone code the singsong implementation?

Jul 17 '24 18:07 mishav78

can you do it if I pay you Lucidrains?

Jul 19 '24 07:07 mishav78

@mishav78 there hasn't been better papers since?

Jul 19 '24 13:07 lucidrains

this is the best. I listened to the Google demos. It works very well.

Jul 19 '24 15:07 mishav78

audiolm-pytorch audiolm-pytorch copied to clipboard

Adapting AudioLM to support SingSong style accompaniment generation

audiolm-pytorch
audiolm-pytorch copied to clipboard