FIFO-Diffusion: Generating Infinite Videos from Text without Training through Rolling Video Denoising
Model/Pipeline/Scheduler description
The authors propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Their approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue. Specifically, at each denoising step, this method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail.
However, diagonal denoising is a double-edged sword, as the frames near the tail can take advantage of cleaner ones by forward reference, but such a strategy induces the discrepancy between training and inference. To reduce this gap, the authors introduce latent partitioning to reduce the training-inference gap, and lookahead denoising to leverage the benefit of forward referencing.
The authors demonstrate promising results on existing pretrained text-to-video generation models such as VideoCrafter, OpenSora Plan, and ZeroScope.
Open source status
- [X] The model implementation is available.
- [ ] The model weights are available (Only relevant if addition is not a scheduler).
Provide useful links for the implementation
Project Page: https://jjihwan.github.io/projects/FIFO-Diffusion Code: https://github.com/jjihwan/FIFO-Diffusion_public Arxiv: https://arxiv.org/abs/2405.11473 Contact: @jjihwan
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Based on our testing, we found FIFO diffusion to be much slower in comparison to alternative methods for generating longer videos (in single GPU inference cases - which is what Diffusers focuses on). Due to this reason, we will not be able to support this in core diffusers. Community pipelines would be nice to have though! cc @yiyixuxu
hey gonna work on this this week!
hey gonna work on this this week!
any news?
hey! sorry i had actually completely forgotten about this :( , if it is still open and i can ask for help when i feel stuck i would love to contribute! , thank you for your patience
Hey. So a few months ago, I benchmarked various training-free approaches to generating longer videos including FIFO (primarily on AnimateDiff and Videocrafter). I found that the generation time was much longer compared to alternative methods like FreeNoise, but the generation quality was great. Ideally, it'd be nice to support this in core, but it will have to be done in a way that doesn't require modifying the pipelines or modeling implementations (such as with hooks: https://github.com/huggingface/diffusers/tree/main/src/diffusers/hooks). We can also consider for modular diffusers #9672 as that should hopefully make it easier to work with custom methods like this.
Also, I don't know if FIFO has been tried out on the latest video models that employ 4-8x temporal compression. My intution says it might be harder to make it work since each latent frame encodes multiple pixel-space frames, and doing any kind of averaging might yield poorer generation (see #9389). Even if we can't implement in a way that makes it work with all models in core diffusers, would be happy to accept and review any PRs for the community examples!