Awesome-Generative-Image-Composition
Awesome-Generative-Image-Composition copied to clipboard
A curated list of papers, code, and resources pertaining to generative image composition or object insertion.
Awesome Generative Image Composition
A curated list of resources including papers, datasets, and relevant links pertaining to generative image composition (object insertion), which aims to generate plausible composite images based on a background image (optional bounding box) and a (resp., a few) foreground image (resp., images) of a specific object.
Contributing
Contributions are welcome. If you wish to contribute, feel free to send a pull request. If you have suggestions for new sections to be included, please raise an issue and discuss before sending a pull request.
Table of Contents
- Survey
- Evaluation Metrics
- Test Set
- Leaderboard
- Papers
- Related Topics
- Other Resources
Survey
A brief review on generative image composition is included in the following survey on image composition:
Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, Liqing Zhang: "Making Images Real Again: A Comprehensive Survey on Deep Image Composition." arXiv preprint arXiv:2106.14490 (2021). [arXiv]
Evaluation Metrics
Test Set
- COCOEE (within-domain, single-ref): 500 background images from MSCOCO validation set. Each background image has a bounding box and a foreground image from MSCOCO training set.
- TF-ICON test benchmark (cross-domain, single-ref): 332 samples. Each sample consists of a background image, a foreground image, a user mask, and a text prompt.
- FOSCom (within-domain, single-ref): 640 background images from Internet. Each background image has a manually annotated bounding box and a foreground image from MSCOCO training set.
- DreamEditBench (within-domain, multi-ref): 220 background images and 30 unique foreground objects from 15 categories.
- MureCom (within-domain, multi-ref): 640 background images and 96 unique foreground objects from 32 categories.
Leaderboard
The training set is open. The test set is COCOEE benchmark.
Method | Foreground | Background | Overall | ||||
---|---|---|---|---|---|---|---|
CLIP↑ | DINO↑ | FID↓ | LSSIM↑ | LPIPS↓ | FID↓ | QS↑ | |
Inpaint&Paste | - | - | 8.0 | - | - | 3.64 | 72.07 |
SDEdit | 85.02 | 55.38 | 9.77 | 0.630 | 0.344 | 6.42 | 75.20 | PBE | 84.84 | 52.52 | 6.24 | 0.823 | 0.116 | 3.18 | 77.80 | ObjectStitch | 85.97 | 61.12 | 6.86 | 0.825 | 0.116 | 3.35 | 76.86 | AnyDoor | 89.7 | 70.16 | 10.5 | 0.870 | 0.109 | 3.60 | 76.18 | ControlCom | 88.31 | 63.67 | 6.28 | 0.826 | 0.114 | 3.19 | 77.84 |
Evaluating Your Results
-
Install Dependencies:
- Begin by installing the dependencies listed in requirements.txt.
- Additionally, install Segment Anything.
-
Clone Repository and Download Pretrained Models:
- Clone this repository and ensure you have a
checkpoints
folder. - Download the following pretrained models into the
checkpoints
folder:- openai/clip-vit-base-patch32: Used for CLIP score and FID score calculations.
- ViT-H SAM model: Utilized to estimate foreground masks for reference images and generated composites.
- facebook/dino-vits16: Employed in DINO score computation.
- coco2017_gmm_k20: Utilized to compute the overall quality score.
The resulting folder structure should resemble the following:
checkpoints/ ├── clip-vit-base-patch32 ├── coco2017_gmm_k20 ├── dino-vits16 └── sam_vit_h_4b8939.pth
- Clone this repository and ensure you have a
-
Prepare COCOEE Benchmark and Your Results:
- Prepare the COCOEE benchmark alongside your generated composite results. Ensure that your composite images have filenames corresponding to the background images of the COCOEE dataset, as illustrated below:
results/ ...... ├── 000002228519_GT.png ├── 000002231413_GT.png ├── 900100065455_GT.png └── 900100376112_GT.png
- Modify the paths accordingly in the
run.sh
file. If you have downloaded the cache file mentioned earlier, please ignorecocodir
. - Execute the following command:
sh run.sh
- Prepare the COCOEE benchmark alongside your generated composite results. Ensure that your composite images have filenames corresponding to the background images of the COCOEE dataset, as illustrated below:
Papers
Object-to-Object
- Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga: "IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation." CVPR (2024) [arXiv]
- Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Hengshuang Zhao: "AnyDoor: Zero-shot Object-level Image Customization." CVPR (2024) [arXiv] [code] [demo]
- Vishnu Sarukkai, Linden Li, Arden Ma, Christopher Re, Kayvon Fatahalian: "Collage Diffusion." WACV (2024) [pdf] [code]
- Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, Ying Shan: "CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models." arXiv preprint arXiv:2310.19784 (2023) [arXiv] [code]
- Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, Li Niu: "ControlCom: Controllable Image Composition using Diffusion Model." arXiv preprint arXiv:2308.10040 (2023) [arXiv] [code] [demo]
- Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa: "Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model." arXiv preprint arXiv:2306.07596 (2023) [arXiv] [code]
- Roy Hachnochi, Mingrui Zhao, Nadav Orzech, Rinon Gal, Ali Mahdavi-Amiri, Daniel Cohen-Or, Amit Haim Bermano: "Cross-domain Compositing with Pretrained Diffusion Models." arXiv preprint arXiv:2302.10167 (2023) [arXiv] [code]
- Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong: "TF-ICON: Diffusion-based Training-free Cross-domain Image Composition." ICCV (2023) [pdf] [code]
- Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen: "Paint by Example: Exemplar-based Image Editing with Diffusion Models." CVPR (2023) [arXiv] [code] [demo]
- Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, Daniel Aliaga: "ObjectStitch: Generative Object Compositing." CVPR (2023) [arXiv] [code]
- Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A. Efros, Krishna Kumar Singh: "Putting People in Their Place: Affordance-Aware Human Insertion into Scenes." CVPR (2023) [paper] [code]
Token-to-Object
-
Lingxiao Lu, Bo Zhang, Li Niu: "DreamCom: Finetuning Text-guided Inpainting Model for Image Composition." arXiv preprint arXiv:2309.15508 (2023) [arXiv] [code]
-
Tianle Li, Max Ku, Cong Wei, Wenhu Chen: "DreamEdit: Subject-driven Image Editing." TMLR (2023) [arXiv] [code]
Related Topics
Foreground: 3D; Background: image
- Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht: "Scene-Conditional 3D Object Stylization and Composition." arXiv preprint arXiv:2312.12419 (2023) [arXiv] [code]
Foreground: 3D; Background: 3D
- Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, Federico Tombari: "InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes." arXiv preprint arXiv:2401.05335 (2024) [arXiv]
- Rahul Goel, Dhawal Sirikonda, Saurabh Saini, PJ Narayanan: "Interactive Segmentation of Radiance Fields." CVPR (2023) [arXiv] [code]
- Rahul Goel, Dhawal Sirikonda, Rajvi Shah, PJ Narayanan: "FusedRF: Fusing Multiple Radiance Fields." CVPR Workshop (2023) [arXiv]
- Verica Lazova, Vladimir Guzov, Kyle Olszewski, Sergey Tulyakov, Gerard Pons-Moll: "Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation." WACV (2023) [arXiv]
- Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng: "Compressible-composable NeRF via Rank-residual Decomposition." NIPS (2022) [arXiv] [code]
- Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han Zhou, Hujun Bao, Guofeng Zhang, Zhaopeng Cui: "Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering." ICCV (2021) [arXiv] [code]
Foreground: video; Background: image
- Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang: "ActAnywhere: Subject-Aware Video Background Generation." arXiv preprint arXiv:2401.10822 (2024) [arXiv]
Foreground: video; Background: video
-
Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song: "Training-Free Semantic Video Composition via Pre-trained Diffusion Model." arXiv preprint arXiv:2401.09195 (2024) [arXiv]
-
Donghoon Lee, Tomas Pfister, Ming-Hsuan Yang: "Inserting Videos into Videos." CVPR (2019) [pdf]