diffusers [Pipeline] Extending Stable Diffusion for generating videos

Since Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation has been there for sometime now, it'd be cool to officially have it supported from Diffusers 🧨

The best part is the official repository (https://github.com/showlab/Tune-A-Video) itself builds on top of Diffusers.

Architecture-wise, the main change is to inflate the UNet to operate spatiotemporally. This is implemented in the UNet3DConditionModel.

@zhangjiewu will it be possible to publish a few weights on the Hugging Face Hub for the community to try out quickly? Happy to help with the process :)

We're more than happy to help if a community member wants to pick this up. As it will be the first end-to-end video pipeline in Diffusers, I'm very excited about it.

Feb 20 '23 10:02 sayakpaul

@sayakpaul I’d like to work on this.

Feb 21 '23 05:02 Abhinay1997

Please feel free to proceed right away! We are more than happy to help.

Feb 21 '23 05:02 sayakpaul

Seems like there are a few pre-trained models here already: https://huggingface.co/Tune-A-Video-library

Feb 21 '23 05:02 sayakpaul

@sayakpaul what is the expected outcome ?

My understanding is:

We make the TuneAVideoPipeline and it's dependency, UNet3DConditionModel available via diffusers.
We provide some trained TuneAVideoPipeline compatible checkpoints for users to play around with.
We add tests

Let me know if I'm missing something

Feb 21 '23 12:02 Abhinay1997

Yes, that is correct. However, for the pipeline to work, we first need to add the inflated classes of UNet and its related components, such as attention.

Feb 21 '23 12:02 sayakpaul

Ohh Right. Well, I'll start and open a draft PR.

Feb 21 '23 12:02 Abhinay1997

@Abhinay1997 FYI, some findings based on my own experiments with Tune-A-Video:

using prior preservation loss (as implemented in https://github.com/bryandlee/Tune-A-Video/blob/main/train.py) helps a lot with the relevance of the output videos and in avoiding overfitting. This is not part of the official Tune-A-Video implementation, but it's something you might want to consider when building the training script.
Using more training frames (I tried up to 32) resulted in a more stable loss and higher quality output. So, definitely don't hardcode it to the 8 frames found in the official implementation.

Feb 21 '23 17:02 jorgemcgomes

Thank you for sharing your experience, definitely very helpful to keep in mind. However, for the first PR, we won't be considering the training script.

But when we do consider the training script, would you be up for contributing a PR?

Feb 21 '23 17:02 sayakpaul

@sayakpaul Yes, definitely. I'll keep an eye on the PR that @Abhinay1997 will open 👍

Feb 21 '23 17:02 jorgemcgomes

@jorgemcgomes thanks for the inputs. Will keep this in mind. I'm sure we'll need these details down the line.

Feb 22 '23 02:02 Abhinay1997

Hi @sayakpaul added a draft PR #2455.

So, a few follow up questions:-

We'll be replacing einops.rearrange, einops.repeat with equivalent torch operations where applicable, correct ? I see einops being commented out in other files too. Just wanted to confirm.
The BasicTransformerBlock in TuneAVideo's attention.py uses SparseCausalAttention. But diffusers already has a BasicTransformerBlock with CrossAttention. Can we make the BasicTransfomerBlock configurable then ? For now I am calling it BasicSparseTransformerBlock to differentiate.

Feb 22 '23 03:02 Abhinay1997

I replied on your PR :)

Feb 22 '23 03:02 sayakpaul

Hi folks, thank you for your great efforts in integrating Tune-A-Video to diffusers. We have made some updates to our implementation, resulting in improved consistency. We hope that these changes would be helpful.

Please let me know if there is anything I can assist you with. :)

Feb 22 '23 08:02 zhangjiewu

Thanks for letting me know @zhangjiewu. I'll cross check to see if I have missed anything. :)

Feb 22 '23 13:02 Abhinay1997

@sayakpaul , @zhangjiewu I ran into an issue while testing the ported pipeline. I'm able to generate the output but the result is noise.

Any advice for debugging ? I guess I made a mistake while porting the SparseCausalAttention module. Will cross verify that.

Warning ! Flashing image https://user-images.githubusercontent.com/24771261/222959721-dda8bc27-02fe-4f72-88a5-53ee68200056.gif

Mar 05 '23 12:03 Abhinay1997

hi @Abhinay1997, have you been able to solve the problem? If not, could you share your code with me so that I can assist you further?

Mar 09 '23 09:03 zhangjiewu

@zhangjiewu didn't get a chance :( Can you please have a quick glance at this :https://github.com/Abhinay1997/diffusers/blob/tune_a_video_port/src/diffusers/models/cross_attention.py#L673

For now, I was just trying to get it to work. Note that I have replaced einops with torch equivalents.

I'll try using your code as-is ( except for the einops bits) and the older cross attention module to validate.

Mar 09 '23 13:03 Abhinay1997

@zhangjiewu never mind. It was a silly mistake. Fixed it. Here's a sample with the ported code:

ironman_surf

Mar 18 '23 16:03 Abhinay1997

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 12 '23 15:04 github-actions[bot]

diffusers diffusers copied to clipboard

[Pipeline] Extending Stable Diffusion for generating videos

diffusers
diffusers copied to clipboard