diffusers
diffusers copied to clipboard
sliding window support for animatediff vid2vid pipeline
What does this PR do?
- adds support for sliding window contexts to the animatediff video2video pipeline
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline?
- [ ] Did you read our philosophy doc (important for complex PRs)?
- [ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
@a-r-r-o-w
Before writing any code, wanted to clarify the requirements, as my understanding of the sliding window technique in this context isn't clear.
I've looked at a few implementations in:
Questions:
- is it necessary to support stride lengths? that seems to only be required if we're going to later do frame interpolation, correct?
- let's say we use a context length of 16, with an overlap of 4 frames, and a total input video length of 60 frames. Is this the high-level pseudocode?
- generate a list of lists, with each inner list having 16 frames, with an overlap of 4 frames with the previous inner list. the first inner list will be frames with indices [0...15], the second will be [12-15 (overlapping), 16... 27], etc.
- iterate through each outer list, and call the vid2vid pipeline to generate the resulting frames
- collate the results of the final video by taking all the unique frames generated (the ones from the overlapping frames will be generated twice - do we just pick any result from the overlapping frames, or do we have to combine the results somehow, or are we supposed to do 2 passes with the overlapping frames, with the output of the first pass feeding in as an input to the second pass?)
- depending on the motion model, we should have different defaults for the context length, right (16 for SD1.5 based models, 32 for AnimateDiffXL)? Should I just do some introspection to get the motion model and have a mapping dictionary?
cc @rmasiso since they were looking into it too here.
I will be referring to this implementation in this comment. Other implementations are also same/similar. The overall idea is to accumulate all generated samples and then average it out by dividing with the number of times every frame latent was processed. Different frames can be processed different number of times due to how the voodoo magic context_scheduler functions works (it is finally understandable at my fourth glance).
is it necessary to support stride lengths? that seems to only be required if we're going to later do frame interpolation, correct?
I believe stride is neccessary as it allows frames that are farther apart to remain temporally consistent. The code being referred to applies stride as a power of 2^(i - 1), I think. That is,
- if stride is
1
, orcontext_step
being allowed to be[1]
, we get something like:[0, 1, 2, 3, 4, 5, 6, 7]
- if stride is
2
, orcontext_step
being allowed to be[1, 2]
, we get above as well as something like:[0, 2, 4, 6, 8, 10, 12, 14]
(this would improve temporal consistency between these frames) - if stride is
3
, orcontext_step
being allowed to be[1, 2, 4]
, we get both the above as well as something like:[0, 4, 8, 12, 16, ...]
let's say we use a context length of 16, with an overlap of 4 frames, and a total input video length of 60 frames. Is this the high-level pseudocode?
do we just pick any result from the overlapping frames, or do we have to combine the results somehow, or are we supposed to do 2 passes with the overlapping frames,
We don't pick any specific generated latent, for each frame, but instead accumulate all latents for every frame and take the average per frame. From my testing with the original code by ashen-sensored, this results in better generations than just taking any specific generation for a frame. The last sampled latent for each frame also is almost good enough (there is some jumpy-ness/flickering) but averaging works better.
The high-level idea is mostly correct. Let's take a smaller example and understand what happens: (I'm using num_frames=8, context_size=2 (aka max_motion_seq_length in config.json), overlap=2 and stride=2)
latents = ... # tensor of shape (batch_size, num_latent_channels, num_frames, height, width)
latents_accumulated = ...
count_num_process_times = [0] * num_frames
for context_indices in [[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 0, 1], [0, 2, 4, 6]]:
current_latents = latents[context_indices]
processed_latents = process_animatediff(latents)
latents_accumulated[context_indices] += processed_latents
count_num_process_times[context_indices] += 1
final_latents = latents_accumulated / count_num_process_times
Notice there is a cyclic dependency between [6, 7, 0, 1] frames. This can lead to some loss in quality, not too sure... but I've read that it could be bad and it makes sense intuitively - why should later frames affect former? The linked code also looks really confusing and can be simplified to something that more people can easily understand, from first glance, by adding one/two for-loops (to handle stride without ordered_halving or other tricks) with good variable naming.
depending on the motion model, we should have different defaults for the context length, right (16 for SD1.5 based models, 32 for AnimateDiffXL)? Should I just do some introspection to get the motion model and have a mapping dictionary?
context_length would just be motion_adapter.config.max_motion_seq_length from config.json if I understand correctly.
I think what diffusers team would like to have would be methods that can enable/disable long context generation and the __call__
would dispatch to appropiate helper methods. Changing the implementation directly and adding extra parameters to __call__
would make it confusing to newer users, especially because this is a little confusing. Also, the sliding window technique can be added to all animatediff related pipelines and not just vid2vid i think.
cc @DN6 @sayakpaul
I think what diffusers team would like to have would be methods that can enable/disable long context generation and the call would dispatch to appropiate helper methods. Changing the implementation directly and adding extra parameters to call would make it confusing to newer users, especially because this is a little confusing. Also, the sliding window technique can be added to all animatediff related pipelines and not just vid2vid i think.
Yeah, your understanding is correct. However, I will let @DN6 comment on it.
cc @rmasiso since they were looking into it too here.
I will be referring to this implementation in this comment. Other implementations are also same/similar. The overall idea is to accumulate all generated samples and then average it out by dividing with the number of times every frame latent was processed. Different frames can be processed different number of times due to how the voodoo magic context_scheduler functions works (it is finally understandable at my fourth glance).
is it necessary to support stride lengths? that seems to only be required if we're going to later do frame interpolation, correct?
I believe stride is neccessary as it allows frames that are farther apart to remain temporally consistent. The code being referred to applies stride as a power of 2^(i - 1), I think. That is,
- if stride is
1
, orcontext_step
being allowed to be[1]
, we get something like:[0, 1, 2, 3, 4, 5, 6, 7]
- if stride is
2
, orcontext_step
being allowed to be[1, 2]
, we get above as well as something like:[0, 2, 4, 6, 8, 10, 12, 14]
(this would improve temporal consistency between these frames)- if stride is
3
, orcontext_step
being allowed to be[1, 2, 4]
, we get both the above as well as something like:[0, 4, 8, 12, 16, ...]
let's say we use a context length of 16, with an overlap of 4 frames, and a total input video length of 60 frames. Is this the high-level pseudocode?
do we just pick any result from the overlapping frames, or do we have to combine the results somehow, or are we supposed to do 2 passes with the overlapping frames,
We don't pick any specific generated latent, for each frame, but instead accumulate all latents for every frame and take the average per frame. From my testing with the original code by ashen-sensored, this results in better generations than just taking any specific generation for a frame. The last sampled latent for each frame also is almost good enough (there is some jumpy-ness/flickering) but averaging works better.
The high-level idea is mostly correct. Let's take a smaller example and understand what happens: (I'm using num_frames=8, context_size=2 (aka max_motion_seq_length in config.json), overlap=2 and stride=2)
latents = ... # tensor of shape (batch_size, num_latent_channels, num_frames, height, width) latents_accumulated = ... count_num_process_times = [0] * num_frames for context_indices in [[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 0, 1], [0, 2, 4, 6]]: current_latents = latents[context_indices] processed_latents = process_animatediff(latents) latents_accumulated[context_indices] += processed_latents count_num_process_times[context_indices] += 1 final_latents = latents_accumulated / count_num_process_times
Notice there is a cyclic dependency between [6, 7, 0, 1] frames. This can lead to some loss in quality, not too sure... but I've read that it could be bad and it makes sense intuitively - why should later frames affect former? The linked code also looks really confusing and can be simplified to something that more people can easily understand, from first glance, by adding one/two for-loops (to handle stride without ordered_halving or other tricks) with good variable naming.
depending on the motion model, we should have different defaults for the context length, right (16 for SD1.5 based models, 32 for AnimateDiffXL)? Should I just do some introspection to get the motion model and have a mapping dictionary?
context_length would just be motion_adapter.config.max_motion_seq_length from config.json if I understand correctly.
I think what diffusers team would like to have would be methods that can enable/disable long context generation and the
__call__
would dispatch to appropiate helper methods. Changing the implementation directly and adding extra parameters to__call__
would make it confusing to newer users, especially because this is a little confusing. Also, the sliding window technique can be added to all animatediff related pipelines and not just vid2vid i think.cc @DN6 @sayakpaul
thanks for the really helpful context @a-r-r-o-w.
- in your sample code, on the first line, would the 'latents' variable be the output of some previous process? the latents from a single pass of all the frames?
- I see that the 'call' method has 'output_type' and 'latents' parameters which can be leveraged for this use case, so that makes sense
- I'll try to take a look at the stride length code first and get that going
in your sample code, on the first line, would the 'latents' variable be the output of some previous process? the latents from a single pass of all the frames?
latents will just be some random tensor (for txt2vid) or image/video-encoded latents (for img2vid/vid2vid) of shape (batch_size, num_channels, num_frames, height // vae_scale_factor, width / vae_scale_factor)
. These latents will be denoised based on the context_indices, averaged, and decoded to obtain the final video.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Cc @DN6
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
cc @rmasiso since they were looking into it too here.
I will be referring to this implementation in this comment. Other implementations are also same/similar. The overall idea is to accumulate all generated samples and then average it out by dividing with the number of times every frame latent was processed. Different frames can be processed different number of times due to how the voodoo magic context_scheduler functions works (it is finally understandable at my fourth glance).
is it necessary to support stride lengths? that seems to only be required if we're going to later do frame interpolation, correct?
I believe stride is neccessary as it allows frames that are farther apart to remain temporally consistent. The code being referred to applies stride as a power of 2^(i - 1), I think. That is,
- if stride is
1
, orcontext_step
being allowed to be[1]
, we get something like:[0, 1, 2, 3, 4, 5, 6, 7]
- if stride is
2
, orcontext_step
being allowed to be[1, 2]
, we get above as well as something like:[0, 2, 4, 6, 8, 10, 12, 14]
(this would improve temporal consistency between these frames)- if stride is
3
, orcontext_step
being allowed to be[1, 2, 4]
, we get both the above as well as something like:[0, 4, 8, 12, 16, ...]
let's say we use a context length of 16, with an overlap of 4 frames, and a total input video length of 60 frames. Is this the high-level pseudocode?
do we just pick any result from the overlapping frames, or do we have to combine the results somehow, or are we supposed to do 2 passes with the overlapping frames,
We don't pick any specific generated latent, for each frame, but instead accumulate all latents for every frame and take the average per frame. From my testing with the original code by ashen-sensored, this results in better generations than just taking any specific generation for a frame. The last sampled latent for each frame also is almost good enough (there is some jumpy-ness/flickering) but averaging works better.
The high-level idea is mostly correct. Let's take a smaller example and understand what happens: (I'm using num_frames=8, context_size=2 (aka max_motion_seq_length in config.json), overlap=2 and stride=2)
latents = ... # tensor of shape (batch_size, num_latent_channels, num_frames, height, width) latents_accumulated = ... count_num_process_times = [0] * num_frames for context_indices in [[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 0, 1], [0, 2, 4, 6]]: current_latents = latents[context_indices] processed_latents = process_animatediff(latents) latents_accumulated[context_indices] += processed_latents count_num_process_times[context_indices] += 1 final_latents = latents_accumulated / count_num_process_times
Notice there is a cyclic dependency between [6, 7, 0, 1] frames. This can lead to some loss in quality, not too sure... but I've read that it could be bad and it makes sense intuitively - why should later frames affect former? The linked code also looks really confusing and can be simplified to something that more people can easily understand, from first glance, by adding one/two for-loops (to handle stride without ordered_halving or other tricks) with good variable naming.
depending on the motion model, we should have different defaults for the context length, right (16 for SD1.5 based models, 32 for AnimateDiffXL)? Should I just do some introspection to get the motion model and have a mapping dictionary?
context_length would just be motion_adapter.config.max_motion_seq_length from config.json if I understand correctly.
I think what diffusers team would like to have would be methods that can enable/disable long context generation and the
__call__
would dispatch to appropiate helper methods. Changing the implementation directly and adding extra parameters to__call__
would make it confusing to newer users, especially because this is a little confusing. Also, the sliding window technique can be added to all animatediff related pipelines and not just vid2vid i think.cc @DN6 @sayakpaul
The variable current_latents
is not used anywhere. Do I understand correctly that the next line should be
processed_latents = process_animatediff(current_latents)
instead of
processed_latents = process_animatediff(latents)
?
@JosefKuchar My bad, typo. That is correct
is this ready for a review?
Hi all. I took a look into some of the different approaches for longer video generations with AnimateDiff. The ones available in AnimateDiff-Evolved use the approach being discussed here; a sliding window, while averaging out the latents of overlapping frames. This seems to be inspired by the MultiDiffusion approach for generating paranomic images: https://github.com/huggingface/diffusers/blob/66f94eaa0c68a893b2aba1ec9f79ee7890786fba/src/diffusers/pipelines/stable_diffusion_panorama/pipeline_stable_diffusion_panorama.py#L763-L772
Except we apply it temporally rather than spatially.
Another approach is FreeNoise, which also uses a sliding window, but applies it in the layers of the motion modules https://github.com/arthur-qiu/FreeNoise-AnimateDiff/blob/e01d82233c595ce22f1a5eba487911c345ce7b5b/animatediff/models/motion_module.py#L262-L280
FreeNoise seems like a more principled approach, and avoids relying on a "magic" context scheduler. I haven't compared the quality of FreeNoise vs Context Scheduler though.
Additionally, the Context Scheduler approach can theoretically handle an infinitely long video sequence. A very long sequence of latents can be held in RAM, and only the context latents with a fixed length go through the forward pass of the model.
With FreeNoise, the breaking up the long sequence into context latents only happens in the motion modules, so the other UNet layers have to deal with the longer sequence. We could do some work to the UNet Motion blocks to enable chunked inference, e.g. something similar to unet.enable_feed_fordward_chunking
but for latent sequences. Or just enable chunked inference in the blocks by default.
Any update on the progress of this?
A better approach is to also go through the PRs sometimes to get a head start if a feature has been in the motion to be shipped :-)
If @DN6 is busy with other things and does not have the bandwidth for this right now, I'll be happy to eventually pick this up on a weekend when I find time. But please feel free to PR if you'd like to take this up. AFAICT, there will not be too many code changes and most of it can be adapted directly from their repo. It'd be preferable to have enable_() and disable_() methods for doing it. It's also something the community has been using for a while so I think there should be discussions or improved implementations for this that you could try looking for.
Would like to work on this, but unfortunately blocked by https://github.com/huggingface/diffusers/issues/7378#issue-2193812158. I believe in order to be able to work on this, I'd need to be able decode + encode the latents independently
Would love to see FreeNoise implementation. I've hacked together implementation for basic chunking above, but results are not that great (only openpose controlnet, no vid2vid). Unfortunately I don't have skills for porting original FreeNoise animatediff implementation (changes here https://github.com/arthur-qiu/FreeNoise-AnimateDiff/commit/9abf5ed9ac9a6efd06aa0a0ea60f2d9790ea72a5) to diffusers
Ok, so I was able to port FreeNoise Animatediff code to diffusers - results below (128 frames) @DN6 @a-r-r-o-w Shall i open separate PR? Any hints for implementing chunked inference on unet motion blocks? - 128 frames fits in 24GB VRAM, 256 frames overflows (I think that is for separate PR anyway, would love to implement it)
prompt "Animated man in a suit on a beach", using community animatediff controlnet pipeline, AnimateDiff lighting 4 steps version (5 step inference)
https://github.com/huggingface/diffusers/assets/12010072/53acb832-2e83-4cd3-a0e8-f76af87832b8
https://github.com/huggingface/diffusers/assets/12010072/b0bd3384-e84c-4e71-b817-bd720551ce22
@JosefKuchar Not a maintainer here but I'd say please go for the PR :heart: Supporting long context generation has been out for Comfy and A1111 for long, and it has been on our mind to add support for this within diffusers for many months now. Thank you so much for taking the initiative! The community has been generating short films using these methods with the best models out there and have nailed down many tricks for consistent and high quality generation; something we could definitely write guides about (perhaps @asomoza would be a great help for this). I'm happy to help resolve any conflicts that may come up with supporting both FreeInit and FreeNoise. Chunked unet inference could be a separate thing to look at in the near future yep
Hi all. Really nice to see the initiative here!
I'll have bandwidth to take this up next week. @JosefKuchar since you've already started on FreeNoise I'll leave you to it and look into sliding window. I'll probably just follow this reference. I believe @a-r-r-o-w had included it when originally proposing the AnimateDiff PR, but we weren't 100% sure about adding something that wasn't fully understood at the time.
@JosefKuchar For chunking, you would need to look at the the Resnet and Attention blocks in the MotionBlocks here. An example of chunking logic can be found here and here. If you feel it's a bit much to handle all at once, feel free to open a PR with just FreeNoise as is and we can work on chunking in a follow up.
New relevant work regarding long context generation: https://github.com/TMElyralab/MuseV/. Thought it might be interesting to share here since we're looking at similar things
Hi all. I have to prioritise some other work at the moment so will have to pause on working on sliding window for now. Will try to pick it up later, but if anyone wants to take a shot at it, feel free to do so and tag me for a review.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.