Oscar icon indicating copy to clipboard operation
Oscar copied to clipboard

VQA object tags are different from image feature

Open kehanlu opened this issue 3 years ago • 5 comments

Hi, I am currently working on VQA datasets. The VQA fine-tune Oscar-base script from VinVL_MODEL_ZOO.md use --data_label_type mask, so it will use the text data from train2014_qla_mrcnn.json downloaded from https://biglmdiag.blob.core.windows.net/vinvl/datasets/vqa

I found that the object tags in train2014_qla_mrcnn.json are different from the prediction.tsv downloaded from pre-exacted COCO 2014 Train/Val Image Features (~50G). But the img_features length are the same.

Because the script use--img_feature_type faster_r-cnn and --data_label_type mask. I guess the input object tags(text) use tags from mask and the image feature use the feature from faster_r-cnn.

Can you explain the design choice? Do you have the experiment result of --img_feature_type faster_r-cnn and --data_label_type faster?

Thanks!

kehanlu avatar Mar 22 '21 09:03 kehanlu

Excuse me, may I ask whether you have these files train+val2014_qla_mrcnn.json, test2015_qla_mrcnn.json and test-dev2015_qla_mrcnn.json? I found these files are missing, making it difficult for inference and official evaluation.

yangapku avatar Mar 23 '21 11:03 yangapku

Excuse me, may I ask whether you have these files train+val2014_qla_mrcnn.json, test2015_qla_mrcnn.json and test-dev2015_qla_mrcnn.json? I found these files are missing, making it difficult for inference and official evaluation.

No, they didn't provide in DOWNLOAD. I think we should create them by ourselves somehow.

kehanlu avatar Mar 29 '21 11:03 kehanlu

In this closed issue (#13), I noticed the author has mentioned the way to generate the mask-rcnn-based object labels. I tried to reproduce the labels on the VQA training images. My generated labels are similar but still with some differences compared with the released image labels. I'm not sure whether these generated labels can reproduce the same VQA scores.

yangapku avatar Apr 06 '21 09:04 yangapku

I have exactly the same question. I am so confused about which image features are used for VQA fine-tuning. Whether with predictions.tsv (VinVL features), image_feature_type(faster_r-cnn), or data_label_type(mask r-cnn? https://github.com/microsoft/Oscar/issues/13#issuecomment-645809973_) Have you figured it out? Many thanks!

CCYChongyanChen avatar Oct 08 '21 17:10 CCYChongyanChen

Same question

shizhediao avatar Jan 27 '22 01:01 shizhediao