X2-VLM
X2-VLM copied to clipboard
All-In-One VLM: Image + Video + Transfer to Other Languages / Domains (TPAMI 2023)
X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM with a modular architecture performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. We also show that the modular design of X2-VLM results in high transferability for X2-VLM to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training.
- Jun 2023: Release official PyTorch implementation and checkpoints
- Nov 2022: Release preprint in arxiv.
Features
- Support several backbones
- vision encoder: beit / clip-vit / swin-transformer
- text encoder: bert / roberta
- Support apex O1 / O2 for pre-training
- Read from and write to HDFS
- Distributed training across nodes for both pre-training and fine-tuning
Please read the code for more details.
Requirements
- Install python3 environment
pip3 install -r requirements.txt
- Download raw images from corresponding websites
- Download the json files we provided, which contains image read paths and captions and/or bbox annotations
- If running pre-training scripts:
- install Apex
- download pre-trained models for parameter initialization
- image encoder: beit2
- text encoder: bert
Pretrain
# X-VLM pretrain
python3 run.py --task "pretrain_DIY" --dist "all" --config "configs/pretrain/x2vlm_base_4m.yaml" --output_dir "output/tmp"
# CCLM multilingual multimodal pretrain
python3 run.py --task "pretrain_DIY" --dist "all" --config "configs/pretrain/multilingual_cclm_x2vlm_base.yaml" --checkpoint "path/to/x2vlm_base_1b.th" --output_dir "output/tmp"
See run.py and configs/pretrain for more details.
Data
All datasets we utilized are public available. Please prepare the pre-training data by yourself. Read the code dataset/pretrain_dataset.py (more specifically ImageTextJsonDataset & RegionTextJsonDataset) to see what format is needed.
The processed COCO & VG annotations can be downloaded here.
Checkpoints
Please make sure all parameters are loaded correctly.
X2VLM-base (4M)
X2VLM-large (4M)
X2VLM-base (1B)
X2VLM-large (1B)
CCLM-X2VLM-base
CCLM-X2VLM-large
Finetune
Data
All datasets are publicly available. Some datasets can be downloaded here.
Checkpoints, Configs and Logs
We have released all codes. However, now we only provide parts of fine-tuned ckpts (and training configs and logs).
vqa-base
vqa-large
captioning-large
refcoco-bbox-large
It takes time for us to retrieve our previous training logs. If you need more, please submit a Github issue and we will return to your request later.
coco-retrieval-base-rerun
coco-retrieval-large-rerun
Examples
# train
python3 run.py --task "vqa" --dist "all" --config "configs/finetune/vqa2_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th" --output_dir "output/tmp"
python3 run.py --task "refcoco_bbox" --dist "all" --config "configs/finetune/refcoco_grounding_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th" --output_dir "output/tmp"
python3 run.py --task "coco_captioning_mlm" --dist "all" --config "configs/finetune/coco_captioning_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th" --output_dir "output/tmp"
We release all training codes. Specify "--task" and "--config" to finetune on other tasks. See run.py for details.
Citation
If you find this repository useful, please considering giving ⭐ or citing:
@article{zeng2022x,
title={X $\^{} 2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks},
author={Zeng, Yan and Zhang, Xinsong and Li, Hang and Wang, Jiawei and Zhang, Jipeng and Zhou, Wangchunshu},
journal={arXiv preprint arXiv:2211.12402},
year={2022}
}
@article{zeng2022cross,
title={Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training},
author={Zeng, Yan and Zhou, Wangchunshu and Luo, Ao and Zhang, Xinsong},
journal={arXiv preprint arXiv:2206.00621},
year={2022}
}
@article{xvlm,
title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
journal={arXiv preprint arXiv:2111.08276},
year={2021}
}
Contact
For issues using this code, please submit a GitHub issue.