diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Add AudioLDM

Open sanchit-gandhi opened this issue 1 year ago • 2 comments

Original codebase: https://github.com/haoheliu/AudioLDM Checkpoints: https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation/tree/main/ckpt

TODOs

UNet

  • [x] Convert UNet weights
  • [x] Add new modelling code
  • [x] Verify correctness

VAE

  • [x] Convert VAE weights
  • [x] Verify correctness

Scheduler

  • [x] Verify correctness

CLAP Text Embedding Model

  • [x] Convert CLAP weights
  • [x] Verify correctness

HiFiGAN Vocoder

  • [x] Convert HiFiGAN weights
  • [x] Verify correctness

Pipeline

  • [x] Verify correctness
  • [x] Tests

Docs

  • [x] Add and populate docs mdx file

sanchit-gandhi avatar Feb 03 '23 11:02 sanchit-gandhi

The documentation is not available anymore as the PR was closed or merged.

Implementation matches the original ✅ Tests + clean-up TODO

sanchit-gandhi avatar Feb 20 '23 17:02 sanchit-gandhi

Gently pinging @williamberman 🙂

sanchit-gandhi avatar Feb 24 '23 17:02 sanchit-gandhi

Nice! All comments are small things. Looks basically good to go

williamberman avatar Feb 24 '23 19:02 williamberman

Not sure if anyone else has this issue, but just today I started getting the following error:

Traceback (most recent call last):
  File "/workspace/test.py", line 12, in <module>
    audio = pipeline(
  File "/workspace/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/env/lib/python3.10/site-packages/diffusers/pipelines/audioldm/pipeline_audioldm.py", line 601, in __call__
    audio = self.mel_spectrogram_to_waveform(mel_spectrogram)
  File "/workspace/env/lib/python3.10/site-packages/diffusers/pipelines/audioldm/pipeline_audioldm.py", line 342, in mel_spectrogram_to_waveform
    waveform = self.vocoder(mel_spectrogram)
  File "/workspace/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/env/lib/python3.10/site-packages/transformers/models/speecht5/modeling_speecht5.py", line 3047, in forward
    hidden_states = self.conv_pre(hidden_states)
  File "/workspace/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/env/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/workspace/env/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 4128, 1, 64]

I did verify if the shapes are correct and it looks like so: Squeezing: torch.Size([1, 1, 4128, 64]) Result: torch.Size([1, 4128, 64])

chavinlo avatar Feb 27 '23 05:02 chavinlo

Hey @chavinlo, you can fix this by rebasing transformers onto main (specifically this commit https://github.com/huggingface/transformers/pull/21702). This PR is still WIP so is not guaranteed to be stable until it's merged.

sanchit-gandhi avatar Feb 27 '23 09:02 sanchit-gandhi

Hey @chavinlo, you can fix this by rebasing transformers onto main (specifically this commit huggingface/transformers#21702). This PR is still WIP so is not guaranteed to be stable until it's merged.

is it me or it has already been merged 5 days ago... https://github.com/huggingface/transformers/tree/main/src/transformers/models/speecht5

also thanks, installing from source fixed the issue

chavinlo avatar Feb 27 '23 12:02 chavinlo

is it me or it has already been merged 5 days ago...

Indeed, the SpeechT5 fix was merged 5 days ago into transformers, but this PR for AudioLDM in diffusers is still a WIP! Thanks for holding tight!

sanchit-gandhi avatar Feb 27 '23 13:02 sanchit-gandhi

What would be the procedure to convert a waveform to a mel spectogram and finally to proper latents? I suppose training AudioLDM with diffusers wouldn't be too different from finetuning SD Text2Image, right?

I am trying to convert the mel spectrogram into a latent using the VAE like stable diffusion, however I get different shapes from the ones that are generated in the pipeline.

chavinlo avatar Feb 28 '23 23:02 chavinlo

@williamberman I think this was accidentally closed no?

patrickvonplaten avatar Mar 07 '23 11:03 patrickvonplaten

@patrickvonplaten yeah I think it happened when I merged the other PR

williamberman avatar Mar 14 '23 08:03 williamberman

Swapped height -> audio_length_in_s and slimmed-down the number of fast/slow tests. Good to go on my end! Feel free to take a final look at the changes @patrickvonplaten @williamberman

sanchit-gandhi avatar Mar 17 '23 14:03 sanchit-gandhi

@sanchit-gandhi think one last fast test is failing - could you check? :-)

patrickvonplaten avatar Mar 21 '23 14:03 patrickvonplaten

Test failures are unrelated - merging!

patrickvonplaten avatar Mar 23 '23 18:03 patrickvonplaten