audiolm-pytorch
audiolm-pytorch copied to clipboard
Adapting AudioLM to support SingSong style accompaniment generation
Hi @lucidrains - thanks for your awesome work here. Great stuff as always.
I recently came across Google's new SingSong paper (https://arxiv.org/pdf/2301.12662.pdf), in which they adapt AudioLM for generation of instrumental accompaniments conditioned upon sung input vocals, and I was wondering if you (or anyone else 🙂 ) might have any practical advice on implementing the adaptations necessary.
Also, to this end, would you happen to know if anyone has managed to train a decent soundstream model and made it publicly available yet?
Best, and thanks again for your work here, Shaun
Yea I can take of the paper
Also, to this end, would you happen to know if anyone has managed to train a decent soundstream model and made it publicly available yet?
Not yet, but I reckon we will, given my sources :smile:
yea, both this and spear-tts may be too complicated to fit in this repository
i think many audio researchers are forking the audiolm repository within google and extending it to their own work, due to its success
@lucidrains I am also interested in singsong , and now prepare to train a FineTransformer model first. I am wondering 1,How many data are needed for trainning FineTransformer model? 2,How many steps are needed? 3,Is the code FineTransformerTrainer avaiable for the training? I revised part of the code, mostly interface and dataloader part to adapt to my own data , I just start to train the model, but I donot know if using this code is ok for singsong or musiclm.
@Liujingxiu23 ah, you can't just skip to the fine transformer. this work resembles a Matryoshka doll. you will need to train soundstream and 2 transformers successfully before even arriving at fine transformer as well as all the extra singsong networks.
this is why having open sourced foundation models are so important. no one else but internal google teams are able to carry out research
@lucidrains I use codes generated by the Encodec model of facebook, 24khz, 16-codebook, 6 for coarse ad 10 for fine, is that ok?
And about the code, I understand the generate process, code are generated one by one as a LOOP, but I do not understand the training process, why fine codes are send to the model to ? I do not see relative code of "transformer-decoder", I mean each fine code can only see its former codes and its own coarse code in the training process. Which part of code reflects this logic?I am sorry my previous work did not involve "transformer-decoder", I can not easily tell this.
did someone code the singsong implementation?
can you do it if I pay you Lucidrains?
@mishav78 there hasn't been better papers since?
this is the best. I listened to the Google demos. It works very well.