bottom-up-attention
bottom-up-attention copied to clipboard
What features are used to train a VQA model? DO you use only 2048-dimension features?
In your code, the image_id, image_h, image_w, num_boxes, boxes, features were extracted and saved. But in your paper, it seems that only features are used to present the image. Do you use the embedding of the predicted classes or bbox to train a VQA model?
No we didn't use the class labels or the bbox. I did some initial experiments like that but performance didn't change much. Mostly we used the boxes just for visualization.