Wenhui Wang
Wenhui Wang
Hi @PeterDykas, Thanks for the question. We did three forward passes for images, texts, and image-text pairs given the different max length of different modality data.
Hi @NieShenRuc, you can follow BERT/UniLM/RoBERTa to process the text-only data (EN Wikipedia and Bookcorpus), and then split it into several small files. Or you can directly use our stage-2...
Hi @linhuixiao, We release base and large models for efficient usage. In our paper, we report the performance of a giant model. The base and large models we released, although...
Hi @jinxixiang, May I know the batch size you used for training? Maybe you can also remove contrastive loss on VL-FFN to make it simple.
 From your tensorborad, I found vl_i2t and vl_t2i. It can slightly improve the model but it is not very important.
Hi @liuxuannan, We have done some architecture explorations and found that decoupling attention parameters obtains improvements, so we release the models using the better architecure. Please refer to Section 11...
The pretraining is still mask-then-predict. It is not two modalities trained with two transformer models. We also use the image-text pairs to train the model to learn the alignments of...
Hi @xinghua-qu, Please refer to Section 11 at [BEiT-3 Supp](https://openaccess.thecvf.com/content/CVPR2023/supplemental/Wang_Image_as_a_CVPR_2023_supplemental.pdf). We have done some architecture explorations and found that we can decouple attention parameters while still maintaining the ability to...
@ImKeTT, could you provide your training command?
Hi, COCO and VG are easy to download. For SBU, CC3M and CC12M, you can refer to https://github.com/rom1504/img2dataset.