soho
soho copied to clipboard
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
vd呢。。
Where is the link regarding the pre-trained model of Soho based on Resnet101?
Hi; I conduct the pretraining with resent18+3 layer transformer by using indomain data. (without MVM loss) I can get a similar result on VQA downstream tasks, around 66.5 accuracy. But...

Hi, what is your mvm accuracy of pretrained model? I only got about 30% when pretraining and wanted to know if that is normal?
hi, would you release a tool for visualizing the visual dictionary
Hi, many thanks for your sharing SOHO. In Readme.MD, i can only find how to pretrain and train a VQA model. However, there is no instruction to train or evaluate...
Thanks for your great codes. This is an impressive work that may inspire many ones to follow it. Do you plan to release the training configurations and scripts of the...
In the paper, there is a Visual Dictionary(VD) to remodel the image of query, but the class of SOHO_direct_VD(SOHO/models/necks/utils.py) only operate the image by torch.agrmax in the code, which is...