What does this PR do?

This PR adds support for the Stable Video Diffusion version of MotionCtrl as a community pipeline. This is the continuation of #6844 to keep the changes clean. This version of MotionCtrl only supports camera control. For more details, you can check out the linked issue below.

Fixes #6688.

Colab: https://colab.research.google.com/drive/17xIdW-xWk4hCAIkGq0OfiJYUqwWSPSAz?usp=sharing Paper: https://arxiv.org/abs/2312.03641 Project site: https://wzhouxiff.github.io/projects/MotionCtrl/ Authors: @wzhouxiff @jiangyzy @xinntao Tianshui Chen Menghan Xia Ping Luo Ying Shan

Update: MotionCtrl was just featured on Two Minute Papers. What a time to be alive!

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline?
[x] Did you read our philosophy doc (important for complex PRs)?
[x] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@DN6 @sayakpaul

Feb 18 '24 00:02 a-r-r-o-w

Results (manually downscaled):

I've also taken the liberty to add linear interpolation between the camera projection embeddings and original hidden states in SVD. Apart from that, I believe the implementation is faithful to the original implementation. From the results, it can be seen that MotionCtrl with SVD, currently, allows panning, zooming, and other complex camera motions. With the lerp, we can have some more object motion (see results below; 0 is essentially normal SVD because it is not making use of camera proj embeds, with the difference being the attn2 layer weights of TemporalBasicTransformerBlock from original SVD as they were trained too). If you want a more panning/zooming effect, set motionctrl_scale higher using pipe.unet.set_motionctrl_scale(0.8). Increasing camera_speed allows for faster panning/zooming.

motionctrl_scale=0	motionctrl_scale=0.2	motionctrl_scale=0.4

motionctrl_scale=0.6	motionctrl_scale=0.8	motionctrl_scale=1.0

Feb 18 '24 00:02 a-r-r-o-w

@sayakpaul I thought about moving to research_projects but the amount of code duplication was insane (2000+ lines), so I feel it is better here with the hacky unet. I'd like to know your thoughts on the use_legacy flag (which enables/disables the quant_conv layer because some of the new checkpoints like dragnuwa/motionctrl do not use it), and whether the from examples.community.pipeline_stable_video_motionctrl_diffusion import UNetSpatioTemporalConditionMotionCtrlModel part is okay (because you can only use this unet if you clone the repo, and it is not possible if you're trying to directly use after pip install diffusers)

Feb 18 '24 00:02 a-r-r-o-w

At the moment, I use this in https://github.com/a-r-r-o-w/diffusers/blob/45b8a980b8513c18d796ff3bde5d9cdbec0a5d18/examples/community/pipeline_stable_video_motionctrl_diffusion.py#L161. I think we could get rid of this change by using module.norm1.normalized_shape if it is not ideal.

On Sun, 18 Feb 2024 at 10:37, Dejia Xu @.***> wrote:

@.**** commented on this pull request.

In src/diffusers/models/attention.py https://github.com/huggingface/diffusers/pull/7005#discussion_r1493589892 :

@@ -434,6 +434,7 @@ def init( cross_attention_dim: Optional[int] = None, ): super().init()
   self.time_mix_inner_dim = time_mix_inner_dim
Hi, is this variable by chance being used elsewhere?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/diffusers/pull/7005#pullrequestreview-1887085031, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARHLFGTWXEVBBHYQTZDZN6LYUGEBZAVCNFSM6AAAAABDNV5GIKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQOBXGA4DKMBTGE . You are receiving this because you authored the thread.Message ID: @.***>

Feb 18 '24 09:02 a-r-r-o-w

@DN6 @sayakpaul Ready for another review :)

Mar 21 '24 08:03 a-r-r-o-w

There has been two requests so far for a training script. I will see what I can do but I believe it'll be hard to replicate without hardware, and I'm a little busy with work and other PRs. The authors mention that they had something cooking for object control along with camera control for SVD, so it might be wise to wait for that before working on training script.

Mar 21 '24 08:03 a-r-r-o-w

Hi @jhj7905. This looks like a vae issue to me from testing. When you instantiate StableVideoMotionCtrlDiffusionPipeline, do you get some warning or error related to "quant_conv" or "use_legacy"? If so, it might be because you're using the older vae config. Please make sure you're applying these changes and then using the example code: https://huggingface.co/a-r-r-o-w/motionctrl-svd/commit/2251fc151b42048d4d18403747300cbd015a7122

We renamed the use_legacy config parameter to use_quant_conv. If you run without that modification, the quant_conv layers get initialized with random/zero weights I believe which is causing the bad quality. Please let me know if there are any other problems. Thanks

Mar 26 '24 06:03 a-r-r-o-w

Could you verify if you're using this branch for running? That is, installed diffusers using pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl or similar. It is required because there are changes made to the autoencoder file here.

Your error hints towards quant_conv layers being used despite saying not to in the config as those weights are being expected. I haven't been able to reproduce this with my branch, but that error does come up if you're using, say, the main/pypi branch.

Mar 26 '24 06:03 a-r-r-o-w

@a-r-r-o-w Thank you for replying it. installed diffusers using pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl -> solve the problem.. now i implement the training code with your code

Mar 26 '24 07:03 jhj7905

Awesome, glad to know that worked!

Regarding training the SVD version, since a few projection layers are the only addition for Camera Motion module, I went ahead and repurposed the stable diffusion training script last weekend. However, when actually trying to train, 24/32 GB GPUs were not enough (out of memory errors) and I lack access to better compute for testing at the moment, which has put it on hold for me. Would be awesome if you're able to create it :) The idea you mention in our email thread is very cool and lots of potential applications, hope it's a success!

Mar 26 '24 07:03 a-r-r-o-w

@a-r-r-o-w Oh, Cool How about sharing your training code? I checked the SVD training code with this repo(https://github.com/pixeli99/SVD_Xtend/tree/main) and implemented it

Mar 26 '24 08:03 jhj7905

This is what I used too. Only minor changes needed and copying the UNet modifications from here and freezing remaining params. Problem is I run into out-of-memory and can't verify correctness of script. I will put it in a PR some time in near future when I am able to test on A100.

Mar 26 '24 20:03 a-r-r-o-w

@a-r-r-o-w Hello, I have implemented training code with this repo(https://github.com/pixeli99/SVD_Xtend/tree/main) I have questions.. First, There is 'do_classifier_free_guidance' to in StableVideoMotionCtrlDiffusionPipeline Class is it necessary in train code? Second, camera_pose = camera_pose.repeat(2, 1, 1) is it right?

Mar 28 '24 11:03 jhj7905

@jhj7905

First, There is 'do_classifier_free_guidance' to in StableVideoMotionCtrlDiffusionPipeline Class is it necessary in train code?

You can use SVD without classifier free guidance by setting both max_guidance_scale to a value <= 1. You are correct, it should not be necessary in training (or inference).

Second, camera_pose = camera_pose.repeat(2, 1, 1) is it right?

Camera pose is a tensor of shape (num_frames, 3, 4). We repeat the num_frames dimension because we have to apply it for both unconditional and conditional latents. This will always be the case when max_guidance_scale > 1. But I do see a mistake from my side here, and that is the repeat should only happen when do_classifier_free_guidance is True because otherwise there will not be any unconditional latent. I will fix this soon. Is this causing problems in training when classifier free guidance is enabled?

Mar 28 '24 12:03 a-r-r-o-w

@a-r-r-o-w @DN6 Hello, I finished implementing the training code. But the result was quite weird. So I started to debug the code... On my debug process, At first, I found out that the output(video) was different when i ran the code repo(https://github.com/TencentARC/MotionCtrl/tree/svd, https://huggingface.co/TencentARC/MotionCtrl/tree/main) and repo(https://github.com/a-r-r-o-w/diffusers/tree/re-motionctrl, https://huggingface.co/a-r-r-o-w/motionctrl-svd/tree/main) with the same image....

Apr 03 '24 11:04 jhj7905

Can you share an example of difference comparing the output of theirs vs. what we have here? I'm on a bit of a vacation and an not carrying my personal laptop but I can try debugging the difference in implementation code wise

Have you made sure that same seed is used? It could also be possible that the order of operations that depend on random generator is different.

Apr 03 '24 11:04 a-r-r-o-w

@a-r-r-o-w

I used the example images like below with same camera pose(Pan Down). 12_tmprb_li4d3 13_tmp33ht_33x

Results of the repo(https://github.com/TencentARC/MotionCtrl/tree/svd) https://github.com/huggingface/diffusers/assets/21155309/5a6fcc72-ea04-481a-a69c-f6611b31b896 https://github.com/huggingface/diffusers/assets/21155309/44e60d7a-3d73-4d67-a56b-e5c11a2a3919

Results of the your repo(https://github.com/a-r-r-o-w/diffusers/tree/re-motionctrl) animation_13_tmp33ht_33x_temp_0404 animation_12_tmprb_li4d3_temp_0404

I tell you my environment

git clone -b https://github.com/a-r-r-o-w/diffusers.git
pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl
Run the inference code(https://github.com/a-r-r-o-w/diffusers/tree/re-motionctrl/examples/research_projects/motionctrl_svd)

Thank you in advance

Apr 04 '24 01:04 jhj7905

@a-r-r-o-w Did you solve it? Still, I have debugged it

Apr 08 '24 01:04 jhj7905

@a-r-r-o-w Did you solve it? Still, I have debugged it

Hi. I'm on a bit of a vacation and am not carrying my personal laptop to test things out. Apologies for the delay... If you're able to find the mistake, please feel free to fork my branch and add changes. I should be more free in 2-3 days to figure out the problems

Apr 08 '24 05:04 a-r-r-o-w

@a-r-r-o-w Oh. I see... Thank you for replying it... Have a good vacation!! Okay, If i find out the mistake then fork the branch!

Apr 08 '24 06:04 jhj7905

@jhj7905 when I run the inference code I get an AttributeError: 'TemporalBasicTransformerBlock' object has no attribute 'time_mix_inner_dim', do you know how to solve it

Apr 10 '24 04:04 T0L0ve

@jhj7905 when I run the inference code I get an AttributeError: 'TemporalBasicTransformerBlock' object has no attribute 'time_mix_inner_dim', do you know how to solve it

did you install diffusers from my branch? I'm guessing that could be the issue. Try:

pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl

Apr 10 '24 04:04 a-r-r-o-w

@jhj7905 when I run the inference code I get an AttributeError: 'TemporalBasicTransformerBlock' object has no attribute 'time_mix_inner_dim', do you know how to solve it

You can solve it by using pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl

Apr 11 '24 00:04 jhj7905

@DN6 @asomoza @jhj7905 Requesting a review. This is hopefully ready to merge I think.

Apr 23 '24 07:04 a-r-r-o-w

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]

diffusers
diffusers copied to clipboard

[Community] MotionCtrl SVD

What does this PR do?

Before submitting

Who can review?

@.**** commented on this pull request.

diffusers diffusers copied to clipboard

[Community] MotionCtrl SVD

What does this PR do?

Before submitting

Who can review?

@.**** commented on this pull request.

diffusers
diffusers copied to clipboard