Jaemin Cho
Jaemin Cho
As you can see in the [`__getitem__`](https://github.com/j-min/VL-T5/blob/main/VL-T5/src/vqa_data.py#L143-L173) of the Dataset class, `vis_feats` is 2048-dim feature from Faster R-CNN. `boxes` are the 4-point coordinates of bounding boxes. The Faster R-CNN features...
I'm afraid if you didn't load the pretrained checkpoint properly. Please check out [`load_checkpoint`](https://github.com/j-min/VL-T5/blob/main/VL-T5/src/vqa.py#L91) used in vqa.py, defined in [`trainer_base.py`](https://github.com/j-min/VL-T5/blob/cafc314de831ec5c9fcf5b05e91d3f162712836d/VL-T5/src/trainer_base.py#L171).
I created [a google colab](https://colab.research.google.com/github/j-min/VL-T5/blob/main/inference_example.ipynb) for custom image processing. Hope this helps.
Yes, the py-bottom-up-attention repo is compatible with huggingface transformer LXMERT demo. VCR questions (https://visualcommonsense.com/explore/?im=2519) have a different format than VQA, for example, person grounding / multiple-choice. So I don't think...
* `Looking at the VL-T5 paper, it seems like the decoder generates text in an autoregressive manner i.e. it predicts the probability of future text tokens (among all the tokens...
`scores` are from [VQA evaluation](https://visualqa.org/evaluation.html). Many VQA methods train models by directly regressing the soft scores (ex. [lxmert](https://github.com/airsplay/lxmert/blob/master/src/tasks/vqa_model.py)). But in our text-generation based method, I just used [score = 1...
You don't have to use `vqa: ` prefix, especially if you finetune on enough data.
Yes, you can check out VCR for such a setting. You also might want to check [Visual7W](https://arxiv.org/pdf/1511.03416.pdf) and how models tackle these datasets.
Could you please check the version of the transformers package? With `transformers=4.2.1` (mentioned in requirements.txt), both tokenizers yield the same results: ```bash I you. [27, 32099, 25, 5, 1] I...
It's probably because the pretraining objective for text generation (span prediction) always involves short target text. I guess zero-shot captioning might now work well. You would need to tune the...