diffusers
diffusers copied to clipboard
[Pipeline] Extending Stable Diffusion for generating videos
Since Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation has been there for sometime now, it'd be cool to officially have it supported from Diffusers š§Ø
The best part is the official repository (https://github.com/showlab/Tune-A-Video) itself builds on top of Diffusers.
Architecture-wise, the main change is to inflate the UNet to operate spatiotemporally. This is implemented in the UNet3DConditionModel
.
@zhangjiewu will it be possible to publish a few weights on the Hugging Face Hub for the community to try out quickly? Happy to help with the process :)
We're more than happy to help if a community member wants to pick this up. As it will be the first end-to-end video pipeline in Diffusers
, I'm very excited about it.
@sayakpaul Iād like to work on this.
Please feel free to proceed right away! We are more than happy to help.
Seems like there are a few pre-trained models here already: https://huggingface.co/Tune-A-Video-library
@sayakpaul what is the expected outcome ?
My understanding is:
- We make the
TuneAVideoPipeline
and it's dependency,UNet3DConditionModel
available via diffusers. - We provide some trained TuneAVideoPipeline compatible checkpoints for users to play around with.
- We add tests
Let me know if I'm missing something
Yes, that is correct. However, for the pipeline to work, we first need to add the inflated classes of UNet and its related components, such as attention.
Ohh Right. Well, I'll start and open a draft PR.
@Abhinay1997 FYI, some findings based on my own experiments with Tune-A-Video:
-
using prior preservation loss (as implemented in https://github.com/bryandlee/Tune-A-Video/blob/main/train.py) helps a lot with the relevance of the output videos and in avoiding overfitting. This is not part of the official Tune-A-Video implementation, but it's something you might want to consider when building the training script.
-
Using more training frames (I tried up to 32) resulted in a more stable loss and higher quality output. So, definitely don't hardcode it to the 8 frames found in the official implementation.
Thank you for sharing your experience, definitely very helpful to keep in mind. However, for the first PR, we won't be considering the training script.
But when we do consider the training script, would you be up for contributing a PR?
@sayakpaul Yes, definitely. I'll keep an eye on the PR that @Abhinay1997 will open š
@jorgemcgomes thanks for the inputs. Will keep this in mind. I'm sure we'll need these details down the line.
Hi @sayakpaul added a draft PR #2455.
So, a few follow up questions:-
-
We'll be replacing
einops.rearrange
,einops.repeat
with equivalent torch operations where applicable, correct ? I see einops being commented out in other files too. Just wanted to confirm. -
The
BasicTransformerBlock
in TuneAVideo'sattention.py
usesSparseCausalAttention
. But diffusers already has aBasicTransformerBlock
withCrossAttention
. Can we make theBasicTransfomerBlock
configurable then ? For now I am calling itBasicSparseTransformerBlock
to differentiate.
I replied on your PR :)
Hi folks, thank you for your great efforts in integrating Tune-A-Video to diffusers. We have made some updates to our implementation, resulting in improved consistency. We hope that these changes would be helpful.
Please let me know if there is anything I can assist you with. :)
Thanks for letting me know @zhangjiewu. I'll cross check to see if I have missed anything. :)
@sayakpaul , @zhangjiewu I ran into an issue while testing the ported pipeline. I'm able to generate the output but the result is noise.
Any advice for debugging ? I guess I made a mistake while porting the SparseCausalAttention
module. Will cross verify that.
Warning ! Flashing image https://user-images.githubusercontent.com/24771261/222959721-dda8bc27-02fe-4f72-88a5-53ee68200056.gif
hi @Abhinay1997, have you been able to solve the problem? If not, could you share your code with me so that I can assist you further?
@zhangjiewu didn't get a chance :( Can you please have a quick glance at this :https://github.com/Abhinay1997/diffusers/blob/tune_a_video_port/src/diffusers/models/cross_attention.py#L673
For now, I was just trying to get it to work. Note that I have replaced einops with torch equivalents.
I'll try using your code as-is ( except for the einops bits) and the older cross attention module to validate.
@zhangjiewu never mind. It was a silly mistake. Fixed it. Here's a sample with the ported code:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.