BriVL icon indicating copy to clipboard operation
BriVL copied to clipboard

new image bbox

Open 21157651 opened this issue 3 years ago • 13 comments

how to get 'bbox' in BriVL/BriVL-code-inference/data/jsonls/example.jsonl

21157651 avatar Aug 27 '21 07:08 21157651

The bbox of these examples have 100 ROI. How to use FasterRCNN to detect these crazy numbers of Objects?

knaffe avatar Aug 31 '21 07:08 knaffe

I used Detectron2 with weight mask_rcnn_R_50_FPN_3x.yaml to get 100 candidate bboxes, but the coordinates are not exactly the same as that in the example.jsonl. So I'd like to know if the object detector used in the project could be provided for complete reproduction.

MischaQI avatar Sep 01 '21 02:09 MischaQI

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

chuhaojin avatar Sep 01 '21 03:09 chuhaojin

By the way, I have test the AIC-ICC Validset from BriVL-API1.0, but the retrieval result is too low (Recall@1 < 1%). I use your released code for 'retrieval ' and Faiss vector retrieval, but still have disappointing result. Could you release more details in this Exp of Paper?

knaffe avatar Sep 01 '21 03:09 knaffe

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

zgj-gutou avatar Sep 03 '21 03:09 zgj-gutou

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

Due to the difference between the library versions or the machines, the results of the bounding box will be slightly random, which will not affect the performance of BriVL. In addition, you can calculate the IoU values of these two sets of bounding boxes to verify their correctness.

chuhaojin avatar Sep 03 '21 03:09 chuhaojin

We just fixed a bug: Change the image size in cfg/test.yml to 380. Please pay attention to this when using BriVL, sorry for the inconvenience.

chuhaojin avatar Sep 03 '21 08:09 chuhaojin

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

I can reproduce the bboxes same as those in example.jsonl

troilus-canva avatar Sep 05 '21 00:09 troilus-canva

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

I can reproduce the bboxes same as those in example.jsonl

hello, how can you do that ? Can you tell me what is changed in the extract-bua-caffe-r101.yaml file ? thank you!!!

zgj-gutou avatar Sep 06 '21 13:09 zgj-gutou

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

I can reproduce the bboxes same as those in example.jsonl

hello, how can you do that ? Can you tell me what is changed in the extract-bua-caffe-r101.yaml file ? thank you!!!

I didn't change anything except device from cuda to cpu, as I'm running it on mac. And run the command mentioned in the readme python3 bbox_extractor.py --img_path ../BriVL/data/imgs/baike_14014334_0.jpg --out_path test_data/test1.npz

troilus-canva avatar Sep 07 '21 01:09 troilus-canva

By the way, I have test the AIC-ICC Validset from BriVL-API1.0, but the retrieval result is too low (Recall@1 < 1%). I use your released code for 'retrieval ' and Faiss vector retrieval, but still have disappointing result. Could you release more details in this Exp of Paper?

Hi, I got a similar results as yours on the AIC-ICC validation set (30000 images with 5 captions for each image): i2t R1: 1.57%, t2i R1:0.48% After going into details, I found the model did provide some resonable results, e.g. Screenshot_from_2021-10-19_17-42-47

The highlighted text in the bottom-left is the query text, the ground truth image is above the text. The three images on the right are the top-3 images matched by the model. However, as the example shows, the model only matches words "裙子", "女孩" and ignores other information, which severely affect the recall.

Moreover, I found another paper (https://arxiv.org/abs/2109.04699v2) that did the same evaluation on AIC-ICC dataset. In their paper, they mentioned that they conducted experiments on the "test subset" of AIC-ICC, which only contains 10000 "data". The results reported in their paper about the WenLan model are similar as those reported in WenLan paper. But the validation set contains 30000 images and 150000 captions instead. E-CLIP_dataset_detail

May the authors @chuhaojin provide more details of the test set and potential pre-processing procedures? Many thanks!

Qiulin-W avatar Oct 19 '21 11:10 Qiulin-W

@Qiulin-W @knaffe @chuhaojin The following results are tested on AIC-ICC validate dataset using the code of this repo. I can ensure that the processing results of jsonl file are exactly the same as the file provided in the example. image

This result is far inferior to the result in the paper. Any suggestions?

@huang-xx @knaffe Sorry, I don’t know more about the evaluation details of the BriVL model. You can consult the student in the Model Development Group(@moonlitt, who takes charge of this part) for more details.

chuhaojin avatar Nov 18 '21 12:11 chuhaojin

@moonlitt Hello, my evaluation results (i2t r@1 1.09% ; t2i r@1 0.37% ) on AIC-ICC validate dataset(I used 30000 samples) are also far from the results in paper. Could you please share the evaluation codes as a reference?

jim4399266 avatar Nov 29 '21 10:11 jim4399266