papernotes Visual Question Answering: Datasets, Algorithms, and Future Challenges

Visual Question Answering: Datasets, Algorithms, and Future Challenges

Open howardyclo opened this issue 6 years ago • 1 comments

Authors: Kushal Kafle and Christopher Kanan
Organization: Chester F. Carlson Center for Imaging Science Rochester Institute of Technology
Paper: https://arxiv.org/pdf/1610.01465.pdf
Journal: Computer Vision and Image Understanding 2017

Feb 09 '19 11:02 howardyclo

Automatic evaluation metric is not consistent with human judges.
Many captions are applicable to one image.
Dense image captioning addresses the above problem but omits important relationships between objects.
Image captioning task are task agnostic while VQA has more specific and unambiguous answers to their questions, making VQA more amenable to automatic evaluation than image captioning.

Methods:
- Baseline models
- Bayesian and Question-Aware Models
- Attention Based Models
- Bilinear Pooling Methods
- Compositional VQA Models
What methods and techniques work better?
- ResNet-101 > VGG-16 CNN
- Use spatial attention but attention alone does not appear to be sufficient.
- Bayesian and compositional architectures do not significantly improve over comparable models.
- The Neural Module Networks (NMN) models do not outperform comparable non-compositional models, but perform well on positional reason in the SHAPES dataset.
Important problems:
- VQA models suffer from severe language biases in the training dataset. Predicted answer changed if question is rephrased.
- Simple models that do not use attention (combined multiple global image features from VGG-19, ResNet-101 and ResNet-152 using both element-wise multiplication and addition) have been shown to exceed earlier models that used complex attention mechanisms.
- Attention does not ensure good VQA performance, but incorporating attention into a VQA model appears to improve performance over the same model when attention is not used.
- Machine attention is sometimes not as same as human attention. This may be because the regions the model learns to attend to are discriminative due to biases in the dataset and not due to where the algorithm should attend.
- Existing VQA benchmarks are not sufficient to evaluate whether an algorithm has 'solved' VQA.
- Future dataset needs to be larger and less biased.

Feb 09 '19 11:02 howardyclo