xmdpt
xmdpt copied to clipboard
ICML 2024, Official Implementation of "Cross-view Masked Diffusion Transformers for Person Image Synthesis."
[Updated 2024/08/08]. Code released.
[Planned to release in July 2024]
Pytorch Implementation of Cross-view Masked Diffusion Transformers for Person Image Synthesis, ICML 2024.
Authors: Trung X. Pham, Zhang Kang, and Chang D. Yoo.
Introduction
X-MDPT ($\underline{Cross}$-view Masked Diffusion Prediction Transformers) is the first diffusion transformer-based framework, a novel approach designed for pose-guided human image generation. X-MDPT demonstrates exceptional scalability and performance, significantly improving FID, SSIM, and LPIPS metrics as model size increases. Despite its straightforward design, the framework outperforms state-of-the-art approaches on the DeepFashion dataset, excelling in training efficiency and inference speed. The compact 33MB model achieves an FID of 7.42, surpassing the prior most efficient Unet latent diffusion approach PoCoLD (FID of 8.07) with $11\times$ fewer parameters (396MB). The best model surpasses SOTA pixel-based diffusion PIDM with two-thirds of the parameters and achieves $5.43\times$ faster inference.
Efficiency Advantages
Comparisons with state-of-the-arts
Consistent Targets
Setup Environment
We have tested with Pytorch 1.12+cuda11.6, using a docker.
conda create -n xmdpt python=3.8
conda activate xmdpt
pip install -r requirements.txt
Prepare Dataset
Downloading the DeepFashion dataset and processing it into the lmdb format for easy training and inference. Refer to PIDM (CVPR2023) for this LMDB. The data structure should be as follows:
datasets/
|-- [ 38] deepfashion
| |-- [6.4M] train_pairs.txt
| |-- [2.1M] train.lst
| |-- [817K] test_pairs.txt
| |-- [182K] test.lst
| |-- [4.0K] 256-256
| | |-- [8.0K] lock.mdb
| | `-- [2.4G] data.mdb
| |-- [8.7M] pose.rar
| `-- [4.0K] 512-512
| |-- [8.0K] lock.mdb
| `-- [8.4G] data.mdb
| |-- [4.0K] pose
| | |-- [4.0K] WOMEN
| | | |-- [ 12K] Shorts
| | | | |-- [4.0K] id_00007890
| | | | | |-- [ 900] 04_4_full.txt
| | |-- [4.0K] MEN
...
Training
CUDA_VISIBLE_DEVICES=0 bash run_train.sh
By default, it will save checkpoints for every 10k steps. You can use that for inference as below.
Inference
Download all checkpoints and VAE (fine-tuned only decoder) and put them into the correct place as in the default file infer_xmdpt.py.
For the test set of Deep Fashion, run the following
CUDA_VISIBLE_DEVICES=0 python infer_xmdpt.py
It will save the output image samples as in test_img of this repo.
For the arbitrary image, run the following (not implemented)
CUDA_VISIBLE_DEVICES=0 python infer_xmdpt.py --image_path test.png
Pretrained Models
All of our models had been trained and tested using a single A100 (80GB) GPU.
| Model | Step | Resolution | FID | Params | Inference Time | Link |
|---|---|---|---|---|---|---|
| X-MDPT-S | 300k | 256x256 | 7.42 | 33.5M | 1.1s | Link |
| X-MDPT-B | 300k | 256x256 | 6.72 | 131.9M | 1.3s | Link |
| X-MDPT-L | 300k | 256x256 | 6.60 | 460.2M | 3.1s | Link |
| VAE | - | - | - | - | - | Link |
Expected Outputs
Citation If X-MDPT is useful or relevant to your research, please kindly recognize our contributions by citing our papers:
@inproceedings{pham2024crossview,
title={Cross-view Masked Diffusion Transformers for Person Image Synthesis},
author={Trung X. Pham and Kang Zhang and Chang D. Yoo},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=jEoIkNkqyc}
}
Acknowledgements
This work was supported by the Institute for Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korean government (MSIT) (No. 2021-0-01381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments) and (No. 2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics).
Helpful Repo
Thanks nice works of MDT (ICCV2023) and PIDM (CVPR2023) for publishing their codes.