audiocraft
audiocraft copied to clipboard
Configuration for training with CLAP embeddings
I'm wondering if anyone has any configuration info they could share on training with CLAP embeddings?
I want to try the laion/larger_clap_music
model from Huggingface, but it's really unclear to me how the project is supposed to be configured.
Any help greatly appreciated.
Just adding a bit more info, I managed to at least get to an attempt to load larger_clap_music
using this config:
conditioners:
description:
model: clap
clap: # based on
checkpoint: //reference/clap/larger_clap_music/pytorch_model.bin
name: laion/larger_clap_music
model_arch: 'HTSAT-base'
enable_fusion: false
sample_rate: 32000
max_audio_length: 10
audio_stride: 1
dim: 512
attribute: description
normalize: true
quantize: true # use RVQ quantization
n_q: 12
bins: 1024
kmeans_iters: 50
text_p: 0. # probability of using text embed at train time
cache_path: null
But loading the state_dict
fails with a laundry list of "Unexpected key(s)"
I also tried just pointing it to the folder (it complained that it was not a file) and the config.json
inside the HF download (which gave some kind of parse error).
Okay, I can load larger_clap_music
using the ClapModel (and ClapProcessor) from Huggingface, but not in Audiocraft. I see that Audiocraft is based on CLAP from the Laion repo... Does anybody know if there's a way to load the HF weights into the Laion model? Or has anybody hacked the HF ClapModel into Audiocraft, by any chance?
I worked out a way around loading the HF weights. Now what I'm wondering about is how to configure a text prompt for running test generations during training. My goal is to test the performance of training on CLAP audio embeddings and using text embeddings for inference.
Any help greatly appreciated.
In audiocraft, 'test generation' during training is little bit tricky, which is done at the following code part. https://github.com/facebookresearch/audiocraft/blob/69fea8b290ad1b4b40d28f92d1dfc0ab01dbab85/audiocraft/solvers/musicgen.py#L493
We may have to prepare a dataset for generation as same as training data. You may know we can add metadata with .json file to each audio data as shown in the example here. https://github.com/facebookresearch/audiocraft/tree/main/dataset/example
If you don't need to do 'continuation generation' during training, dummy audio should be enough. In this case, you may have to do
- Prepare dummy audio and metafile for test generation
- Add descriptions you want to use for test generation to metafile (.json) with "description" tag.
Thanks so much for the reply!
Digging around the solver code (as you pointed out) it did seem like the joint embedding might want a prompt, so I did add some super simple metadata files. I haven't run it to the point of a test generation yet, but hopefully it works as expected. I haven't added any dummy audio at this point, but I think in the past it's just used the audio from a data.... (I think...??)
Another "gotcha" that wasn't obvious to me at first is that dataset.valid.num_samples
has to be >= the number of GPUs on the system. Makes sense, of course, but I crashed a few times before figuring it out.
Actually though... what determines when it will generate a sample output? I can see it running through train and valid steps, and it's saving checkpoints, but I don't seem to be getting audio. I also want audio sent to wandb, ideally... I do have wandb: with_media_logging: true
set
It seems that 'test generation' runs at end of every epoch as same as evaluation, which is defined in the BaseSolver class (base class of every solver class). https://github.com/facebookresearch/audiocraft/blob/69fea8b290ad1b4b40d28f92d1dfc0ab01dbab85/audiocraft/solvers/base.py#L466
As shown in the method, you can firstly check if your run go through self.should_run_stage('generate') statement. If not, it means 'test generation' is rejected here, so you can find which configuration causes the rejection.
And then, finally audio saving is done in the above mentioned method generate_audio after generating audio samples at the following line: https://github.com/facebookresearch/audiocraft/blob/69fea8b290ad1b4b40d28f92d1dfc0ab01dbab85/audiocraft/solvers/musicgen.py#L562
Yes, I saw from another Issue/comment that the "every" in the "generate" config refers to epochs, not steps. I had it set to 1000, thinking it meant step, so I would have been waiting a while... heh... It's not always super clear when we're using steps (or "updates") and when we're using epochs.