diffusers
diffusers copied to clipboard
Add AudioLDM
Original codebase: https://github.com/haoheliu/AudioLDM Checkpoints: https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation/tree/main/ckpt
TODOs
UNet
- [x] Convert UNet weights
- [x] Add new modelling code
- [x] Verify correctness
VAE
- [x] Convert VAE weights
- [x] Verify correctness
Scheduler
- [x] Verify correctness
CLAP Text Embedding Model
- [x] Convert CLAP weights
- [x] Verify correctness
HiFiGAN Vocoder
- [x] Convert HiFiGAN weights
- [x] Verify correctness
Pipeline
- [x] Verify correctness
- [x] Tests
Docs
- [x] Add and populate docs mdx file
The documentation is not available anymore as the PR was closed or merged.
Implementation matches the original ✅ Tests + clean-up TODO
Gently pinging @williamberman 🙂
Nice! All comments are small things. Looks basically good to go
Not sure if anyone else has this issue, but just today I started getting the following error:
Traceback (most recent call last):
File "/workspace/test.py", line 12, in <module>
audio = pipeline(
File "/workspace/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/workspace/env/lib/python3.10/site-packages/diffusers/pipelines/audioldm/pipeline_audioldm.py", line 601, in __call__
audio = self.mel_spectrogram_to_waveform(mel_spectrogram)
File "/workspace/env/lib/python3.10/site-packages/diffusers/pipelines/audioldm/pipeline_audioldm.py", line 342, in mel_spectrogram_to_waveform
waveform = self.vocoder(mel_spectrogram)
File "/workspace/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/env/lib/python3.10/site-packages/transformers/models/speecht5/modeling_speecht5.py", line 3047, in forward
hidden_states = self.conv_pre(hidden_states)
File "/workspace/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/env/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/workspace/env/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 4128, 1, 64]
I did verify if the shapes are correct and it looks like so: Squeezing: torch.Size([1, 1, 4128, 64]) Result: torch.Size([1, 4128, 64])
Hey @chavinlo, you can fix this by rebasing transformers onto main (specifically this commit https://github.com/huggingface/transformers/pull/21702). This PR is still WIP so is not guaranteed to be stable until it's merged.
Hey @chavinlo, you can fix this by rebasing transformers onto main (specifically this commit huggingface/transformers#21702). This PR is still WIP so is not guaranteed to be stable until it's merged.
is it me or it has already been merged 5 days ago... https://github.com/huggingface/transformers/tree/main/src/transformers/models/speecht5
also thanks, installing from source fixed the issue
is it me or it has already been merged 5 days ago...
Indeed, the SpeechT5 fix was merged 5 days ago into transformers
, but this PR for AudioLDM in diffusers
is still a WIP! Thanks for holding tight!
What would be the procedure to convert a waveform to a mel spectogram and finally to proper latents? I suppose training AudioLDM with diffusers wouldn't be too different from finetuning SD Text2Image, right?
I am trying to convert the mel spectrogram into a latent using the VAE like stable diffusion, however I get different shapes from the ones that are generated in the pipeline.
@williamberman I think this was accidentally closed no?
@patrickvonplaten yeah I think it happened when I merged the other PR
Swapped height
-> audio_length_in_s
and slimmed-down the number of fast/slow tests. Good to go on my end! Feel free to take a final look at the changes @patrickvonplaten @williamberman
@sanchit-gandhi think one last fast test is failing - could you check? :-)
Test failures are unrelated - merging!