SVD_Xtend
SVD_Xtend copied to clipboard
Stable Video Diffusion Training Code and Extensions.

SVD Xtend
Stable Video Diffusion Training Code and Extensions 🚀
:bulb: Highlight
- Finetuning SVD. See Part 1.
- Tracklet-Conditioned Video Generation. Building upon SVD, you can control the movement of objects using tracklets(bbox). See Part 2.
Part 1: Training
Comparison
size=(512, 320), motion_bucket_id=127, fps=7, noise_aug_strength=0.00
generator=torch.manual_seed(111)
Init Image | Before Fine-tuning | After Fine-tuning |
---|---|---|
Video Data Processing
Note that BDD100K is a driving video/image dataset, but this is not a necessity for training. Any video can be used to initiate your training. Please refer to the DummyDataset
data reading logic. In short, you only need to modify self.base_folder
. Then arrange your videos in the following file structure:
self.base_folder
├── video_name1
│ ├── video_frame1
│ ├── video_frame2
│ ...
├── video_name2
│ ├── video_frame1
├── ...
Training Configuration(on the BDD100K dataset)
This training configuration is for reference only, I set all parameters of unet to be trainable during the training and adopted a learning rate of 1e-5.
accelerate launch train_svd.py \
--pretrained_model_name_or_path=/path/to/weight \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=50000 \
--width=512 \
--height=320 \
--checkpointing_steps=1000 --checkpoints_total_limit=1 \
--learning_rate=1e-5 --lr_warmup_steps=0 \
--seed=123 \
--mixed_precision="fp16" \
--validation_steps=200
Part 2: Tracklet2Video
Tracklet2Video
We have attempted to incorporate layout control on top of img2video, which makes the motion of objects more controllable, similar to what is demonstrated in the image below. The code and weights will be updated soon.
It should be noted that we use a resolution of 512*320
for SVD to generate videos, so the quality of the generated videos appears to be poor (which is somewhat unfair to SVD), but our intention is to demonstrate the effectiveness of tracklet control, and we will resolve the issue with video quality as soon as possible.
Init Image | Gen Video by SVD | Gen Video by Ours |
---|---|---|
Methods
We have utilized the Self-Tracking
training from Boximator and the Instance-Enhancer
from TrackDiffusion.
For more details, please refer to the paper.
:label: TODO List
- [ ] Support text2video (WIP)
- [x] Support more conditional inputs, such as layout
:hearts: Acknowledgement
Our model is related to Diffusers and Stability AI. Thanks for their great work!
Thanks Boximator and GLIGEN for their awesome models.
:black_nib: Citation
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@article{li2023trackdiffusion,
title={Trackdiffusion: Multi-object tracking data generation via diffusion models},
author={Li, Pengxiang and Liu, Zhili and Chen, Kai and Hong, Lanqing and Zhuge, Yunzhi and Yeung, Dit-Yan and Lu, Huchuan and Jia, Xu},
journal={arXiv preprint arXiv:2312.00651},
year={2023}
}