papernotes icon indicating copy to clipboard operation
papernotes copied to clipboard

Visual Question Answering: Datasets, Algorithms, and Future Challenges

Open howardyclo opened this issue 6 years ago • 1 comments

Metadata

  • Authors: Kushal Kafle and Christopher Kanan
  • Organization: Chester F. Carlson Center for Imaging Science Rochester Institute of Technology
  • Paper: https://arxiv.org/pdf/1610.01465.pdf
  • Journal: Computer Vision and Image Understanding 2017

howardyclo avatar Feb 09 '19 11:02 howardyclo

Types of Question

  • Object recognition - What is in the image?
  • Object detection - Are there any cats in the image?
  • Attribute classification - What color is the cat?
  • Scene classification - Is it sunny?
  • Counting - How many cats are in the image?
  • Spatial relation - What is between the cat and the sofa?
  • Common sense reasoning - Why is the the girl crying?

VQA v.s. Object Detection and Recognition

  • Label ambiguity: The label depends on the task.
  • No understanding of of the role of an object within a larger context.
  • This is in contrast with VQA.

VQA v.s. Image Captioning

  • Automatic evaluation metric is not consistent with human judges.
  • Many captions are applicable to one image.
  • Dense image captioning addresses the above problem but omits important relationships between objects.
  • Image captioning task are task agnostic while VQA has more specific and unambiguous answers to their questions, making VQA more amenable to automatic evaluation than image captioning.

Dataset Statistics

Evaluation Metrics

Algorithms

  • Methods:
    • Baseline models
    • Bayesian and Question-Aware Models
    • Attention Based Models
    • Bilinear Pooling Methods
    • Compositional VQA Models
  • What methods and techniques work better?
    • ResNet-101 > VGG-16 CNN
    • Use spatial attention but attention alone does not appear to be sufficient.
    • Bayesian and compositional architectures do not significantly improve over comparable models.
    • The Neural Module Networks (NMN) models do not outperform comparable non-compositional models, but perform well on positional reason in the SHAPES dataset.
  • Important problems:
    • VQA models suffer from severe language biases in the training dataset. Predicted answer changed if question is rephrased.
    • Simple models that do not use attention (combined multiple global image features from VGG-19, ResNet-101 and ResNet-152 using both element-wise multiplication and addition) have been shown to exceed earlier models that used complex attention mechanisms.
    • Attention does not ensure good VQA performance, but incorporating attention into a VQA model appears to improve performance over the same model when attention is not used.
    • Machine attention is sometimes not as same as human attention. This may be because the regions the model learns to attend to are discriminative due to biases in the dataset and not due to where the algorithm should attend.
    • Existing VQA benchmarks are not sufficient to evaluate whether an algorithm has 'solved' VQA.
    • Future dataset needs to be larger and less biased.

Another VQA Survey:

howardyclo avatar Feb 09 '19 11:02 howardyclo