audiocraft Configuration for training with CLAP embeddings

I'm wondering if anyone has any configuration info they could share on training with CLAP embeddings? I want to try the laion/larger_clap_music model from Huggingface, but it's really unclear to me how the project is supposed to be configured.

Any help greatly appreciated.

Mar 26 '24 04:03 jbm-composer

Just adding a bit more info, I managed to at least get to an attempt to load larger_clap_music using this config:

conditioners:
  description:
    model: clap
    clap: # based on
      checkpoint: //reference/clap/larger_clap_music/pytorch_model.bin
      name: laion/larger_clap_music
      model_arch: 'HTSAT-base'
      enable_fusion: false
      sample_rate: 32000
      max_audio_length: 10
      audio_stride: 1
      dim: 512
      attribute: description
      normalize: true
      quantize: true  # use RVQ quantization
      n_q: 12
      bins: 1024
      kmeans_iters: 50
      text_p: 0.  # probability of using text embed at train time
      cache_path: null

But loading the state_dict fails with a laundry list of "Unexpected key(s)"

I also tried just pointing it to the folder (it complained that it was not a file) and the config.json inside the HF download (which gave some kind of parse error).

Mar 26 '24 17:03 jbm-composer

Okay, I can load larger_clap_music using the ClapModel (and ClapProcessor) from Huggingface, but not in Audiocraft. I see that Audiocraft is based on CLAP from the Laion repo... Does anybody know if there's a way to load the HF weights into the Laion model? Or has anybody hacked the HF ClapModel into Audiocraft, by any chance?

Mar 27 '24 01:03 jbm-composer

I worked out a way around loading the HF weights. Now what I'm wondering about is how to configure a text prompt for running test generations during training. My goal is to test the performance of training on CLAP audio embeddings and using text embeddings for inference.

Any help greatly appreciated.

Mar 27 '24 23:03 jbm-composer

In audiocraft, 'test generation' during training is little bit tricky, which is done at the following code part. https://github.com/facebookresearch/audiocraft/blob/69fea8b290ad1b4b40d28f92d1dfc0ab01dbab85/audiocraft/solvers/musicgen.py#L493

We may have to prepare a dataset for generation as same as training data. You may know we can add metadata with .json file to each audio data as shown in the example here. https://github.com/facebookresearch/audiocraft/tree/main/dataset/example

If you don't need to do 'continuation generation' during training, dummy audio should be enough. In this case, you may have to do

Prepare dummy audio and metafile for test generation
Add descriptions you want to use for test generation to metafile (.json) with "description" tag.

Mar 28 '24 01:03 yukara-ikemiya

Thanks so much for the reply!

Digging around the solver code (as you pointed out) it did seem like the joint embedding might want a prompt, so I did add some super simple metadata files. I haven't run it to the point of a test generation yet, but hopefully it works as expected. I haven't added any dummy audio at this point, but I think in the past it's just used the audio from a data.... (I think...??)

Another "gotcha" that wasn't obvious to me at first is that dataset.valid.num_samples has to be >= the number of GPUs on the system. Makes sense, of course, but I crashed a few times before figuring it out.

Mar 28 '24 02:03 jbm-composer

Actually though... what determines when it will generate a sample output? I can see it running through train and valid steps, and it's saving checkpoints, but I don't seem to be getting audio. I also want audio sent to wandb, ideally... I do have wandb: with_media_logging: true set

Mar 28 '24 02:03 jbm-composer

It seems that 'test generation' runs at end of every epoch as same as evaluation, which is defined in the BaseSolver class (base class of every solver class). https://github.com/facebookresearch/audiocraft/blob/69fea8b290ad1b4b40d28f92d1dfc0ab01dbab85/audiocraft/solvers/base.py#L466

As shown in the method, you can firstly check if your run go through self.should_run_stage('generate') statement. If not, it means 'test generation' is rejected here, so you can find which configuration causes the rejection.

And then, finally audio saving is done in the above mentioned method generate_audio after generating audio samples at the following line: https://github.com/facebookresearch/audiocraft/blob/69fea8b290ad1b4b40d28f92d1dfc0ab01dbab85/audiocraft/solvers/musicgen.py#L562

Mar 28 '24 15:03 yukara-ikemiya

Yes, I saw from another Issue/comment that the "every" in the "generate" config refers to epochs, not steps. I had it set to 1000, thinking it meant step, so I would have been waiting a while... heh... It's not always super clear when we're using steps (or "updates") and when we're using epochs.

Mar 28 '24 15:03 jbm-composer

audiocraft audiocraft copied to clipboard

Configuration for training with CLAP embeddings

audiocraft
audiocraft copied to clipboard