bottom-up-attention icon indicating copy to clipboard operation
bottom-up-attention copied to clipboard

About test results

Open jwyang opened this issue 8 years ago • 8 comments
trafficstars

Hi, I just run the test code using your trained resnet101 model on the test set. I got the following numbers on object detection task:

Mean AP = 0.0146 Weighted Mean AP = 0.1799 Mean Detection Threshold = 0.328

The mean AP (1.46%) is far from the number (10.2%) you reported in the table at the bottom of readme. The weighted mean AP is a bit higher than the number you reported. I am wondering whether there is a typo in your table.

thanks!

jwyang avatar Sep 11 '17 02:09 jwyang

The mean and weighted mean numbers should be much closer than your results - the only difference is correction for class imbalance. Are you still having an issue with this?

peteanderson80 avatar Sep 27 '17 21:09 peteanderson80

Hi, Peter,

I checked the evaluation code again. The mean AP is computed by averaging over all 1600 entries in app, that is:

print('Mean AP = {:.4f}'.format(np.mean(aps)))

and the weighted mean AP is computed via:

print('Weighted Mean AP = {:.4f}'.format(np.average(aps, weights=weights)))

Since there are merely a part of 1600 categories that appear in the test set (231 from my running), aps will have many zeros in it. In this case, mean(aps) should be low with no doubt.

I guess you reported the mean AP by ruling out all categories with npos = 0, and then get the average on those non-zero entries? When I did like this, I got 10.11%. It is very close to your reported numbers.

jwyang avatar Sep 28 '17 01:09 jwyang

Hi, Peter, I am running the training scripts myself (with fewer gpus). What's number of the final training loss at iteration 380K when you trained the model? . If possible, could you please draw a training curve or provide the training log file? Thanks a lot!

yuzcccc avatar Oct 09 '17 01:10 yuzcccc

Hi @jwyang,

Sorry I haven't responded sooner. We did not exclude zeros in our calculation. It seems like there is some difference in the validation set that is being used, because our 5000 image validation set resulted in no categories with npos = 0 in our evaluation.

Maybe something went wrong with the dataset preprocessing? To help compare I've added the eval.log file from our evaluation to the repo. If it helps I can also add our preprocessed data/cache/vg_1600-400-20_val_gt_roidb.pkl file.

peteanderson80 avatar Oct 15 '17 01:10 peteanderson80

Hi, @peteanderson80 ,

thanks a lot for your replying, and sharing the log file. Yeah, it is very weird to me. I compared the 5000 validation images, they are the same. I will re-pull your code and re-generate the xml files to see whether I can get the same number as yours. I will let you know when I get the results.

thanks again for your help!

jwyang avatar Oct 15 '17 02:10 jwyang

Hi @yuzcccc,

I don't have the original log file, but I've added an example log file from training with a single gpu for 16K iterations, which should give some indication of the expected training loss. From memory I think the final training loss was around 4.0 (compared to about 4.8 in the example log file at iteration 16300).

peteanderson80 avatar Oct 15 '17 04:10 peteanderson80

Thanks @jwyang for investigating. I have shared our pickled datasets so you can see if you get the same:

peteanderson80 avatar Oct 15 '17 04:10 peteanderson80

@jwyang So what makes your accuracy lower than the reported one? I used maskrcnn-benchmark code to train/test the same splits, only got 2.24% mAP ( IoU 0.5).

jayleicn avatar Dec 17 '18 01:12 jayleicn