InteractiveVideo
InteractiveVideo copied to clipboard
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
🔆 Introduction
InteractiveVideo is a user-centric framework for interactive video generation. It highlights the contributions of comprehensive editing by users' intuitive manipulation, and it performs high-quality regional content control and precise motion control. We would like to introduce features as follows:
1. Personalize A Video
![]() |
![]() |
![]() |
| "Purple Flowers." | "Purple Flowers, bee" | "the purple flowers are shaking, a bee is flying" |
![]() |
![]() |
![]() |
| "1 Cat." | "1 Cat, butterfly" | "the small yellow butterfly is flying to the cat's face" |
2. Fine-grained Video Editing
![]() |
![]() |
![]() |
| "flowers." | "flowers." | "windy, the flowers are shaking in the wind" |
![]() |
![]() |
![]() |
| "1 Man." | "1 Man, rose." | "1 Man, smiling." |
3. Powerful Motion Control
InteractiveVideo can perform precise motion control.
![]() |
![]() |
![]() |
| "1 man, dark light " | "the man is turning his body" | "the man is turning his body" |
![]() |
![]() |
![]() |
| "1 beautiful girl with long black hair, and a flower on her head, clouds" | " the girl is turning gradually" | " the girl is turning gradually" |
4. Characters Dressing up
InteractiveVideo can smoothly cooperate with LoRAs and DreamBooth, thus, there are many potential functions of this framework that are still under-explored.
![]() |
![]() |
![]() |
| "Yae Miko" (Genshin Impact) | "Dressing Up " | "Dressing Up" |
⚙️ Quick Start
1. Install Environment via Anaconda
# create a conda environment
conda create -n ivideo python=3.10
conda activate ivideo
# install requirements
pip install -r requirements.txt
2. Prepare Checkpoints
You can simply use the following script to download checkpoints
python scripts/download_models.py
This will take a long time, you can also selectively download checkpoints by modifying "scripts/download_models.py" and "scripts/*.json". Please make sure that there is at least one checkpoint left for each JSON file. Moreover, all checkpoints are listed as follows
- Checkpoints for enjoying image-to-image generation
| Models | Types | Version | Checkpoints |
|---|---|---|---|
| StableDiffusion | - | v1.5 | Huggingface |
| StableDiffusion | - | turbo | Huggingface |
| KoHaKu | Animation | v2.1 | Huggingface |
| LCM-LoRA-StableDiffusion | - | v1.5 | Huggingface |
| LCM-LoRA-StableDiffusion | - | xl | Huggingface |
- Checkpoints for enjoying image-to-video generation
| Models | Types | Version | Checkpoints |
|---|---|---|---|
| StableDiffusion | - | v1.5 | Huggingface |
| PIA (UNet) | - | - | Huggingface |
| Dreambooth | MagicMixRealistic | v5 | Civitai |
| Dreambooth | RCNZCartoon3d | v10 | Civitai |
| Dreambooth | RealisticVision | - | Huggingface |
- Checkpoints for enjoying dragging images.
| Models | Types | Resolution | Checkpoints |
|---|---|---|---|
| StyleGAN-2 | Lions | 512 x 512 | Google Storage |
| StyleGAN-2 | Dogs | 1024 x 1024 | Google Storage |
| StyleGAN-2 | Horses | 256 x 256 | Google Storage |
| StyleGAN-2 | Elephants | 512 x 512 | Google Storage |
| StyleGAN-2 | Face (FFHQ) | 512 x 512 | NGC |
| StyleGAN-2 | Cat Face (AFHQ) | 512 x 512 | NGC |
| StyleGAN-2 | Car | 512 x 512 | CloudFront |
| StyleGAN-2 | Cat | 512 x 512 | CloudFront |
| StyleGAN-2 | Landmark (LHQ) | 256 x 256 | Google Drive |
Also, you can train and try your customized models. You should put your model into the "checkpoints" folder, which is organized as follows
InteractiveVideo # project
|----checkpoints
|----|----drag # Drag
|----|----|----stylegan2_elephants_512_pytorch.pkl
|----|----i2i # Image-2-Image
|----|----|----lora
|----|----|----|----lcm-lora-sdv1-5.safetensors
|----|----i2v # Image-to-Video
|----|----|----unet
|----|----|----|----pia.ckpt
|----|----|----dreambooth
|----|----|----|----realisticVisionV51_v51VAE.safetensors
|----|----diffusion_body
|----|----|----stable-diffusion-v1-5
|----|----|----kohahu-v2-1
|----|----|----sd-turbo
💫 Usage
1. Local demo
To run a local demo, use the following command (recommended)
python demo/main.py
You can also run our web demo locally with
python demo/main_gradio.py
In the following, we provide some instructions for a quick start.
2. Image-to-Image Generation
Input image-to-image text prompts, and click the "Confirm Text" button. The generation is real-time.
3. Image-to-Video Generation
Input image-to-video text prompts, and click the "Confirm Text" button. Then click the "Generate Video" button and wait for seconds.
The generated video might not be satisfactory, but you can properly customize the video with multi-modal instructions. For example, draw butterflies to help the model know the location of them.
4. Drag Image
You can also drag images. First, you should choose a proper checkpoint in the "Drag Image" tab and click the "Drag Mode On" button. It will take a few minutes to prepare. Then you can draw masks, add points, and click the "start" button. Once the result is satisfactory, click the "stop" button.
😉 Citation
If the code and paper help your research, please kindly cite:
@article{zhang2024interactivevideo,
title={InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions},
author={Zhang, Yiyuan and Kang, Yuhao and Zhang, Zhixin and Ding, Xiaohan and Zhao, Sanyuan and Yue, Xiangyu},
year={2024},
eprint={2402.03040},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
🤗 Acknowledgements
Our codebase builds on Stable Diffusion, StreamDiffusion, DragGAN, PTI, and PIA. Thanks the authors for sharing their awesome codebases!
📢 Disclaimer
We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.




















