caffe-jacinto
caffe-jacinto copied to clipboard
Support for RoiPooling Layer
Hello, I see there are Layers for SSD like OD models. Is there any plan to include RoiPooling layer to run RCNN - like models?
Hi,
Can you explain more about your target appliation? Why do you think "RCNN-like models" will be better than "SSD like OD models"?
"RCNN-like models" have better accuracy for small objects like joint prediction of pedestrian & cars. Also if we reduce the region proposals to < 50, they are as efficient as SSD. SSD has much higher number of mboxes for the same mAP so the performance on CPU is not great. On GPU, that will not be the case. I am looking for 20-30 fps level of performance. I am targeting faster-RNN at the moment.
At this stage I am trying to understand the pros and cons of various meta architectures. So I have more questions than answers.
(1) If we compare FasterRCNN and RFCN, RFCN seems to be simpler, and yet comparable in accuracy. So isn't RFCN a better choice?
(2) Have you looked at RefineDet? (which is a variant of SSD) https://arxiv.org/pdf/1711.06897.pdf RefineDet seems to have higher overall AP as well as AP for small objects compared for SSD, FasterRCNN, R-RFCN and RetinaNet.
(3) RetinaNet is quite close - so isn't that a good choice as well?
I am currently trying to run these OD on TDA2px EVM board(TI) at ~22 fps
- I am fine with either of them(RCNN-based, 2 step methods). Both of them will require roiPooling layer.
- Refinedet #bboxes= 16320. This will be big hit on performance since I am running the detection layer on dsp
- I am yet to evaluate retinanet but I think it will comparable to SSD in terms of performance(variant).
I have some more questions. Please bear with me.
- Can I ask what is the performance and input resolution that you expect for SSD?
- Could you explain why RefineDet will have more #bboxes compared to SSD?
- In 2 step methods as well, the first RPN step will have a lot of bboxes. Why would these methods be faster than SSD?
No problem. Here are the details:
- ~25-30 fps on 1024x512 Input resolution with 0.5-0.6 mAP.
- I looked at the architecture. It shows higher number of bboxes. Typical increasing bboxes helps with mAP.
- In RPN step we can prune bboxes to say 50. then the detection layer will only evaluate on those bboxes. In case of SSD-like architecture #bboxes is roughly aspect_ratios* (#anchor_locations)*(num_taps). The detection layer need to process all these bboxes.
Regarding (3) above.
My question was - the RPN stage also has a large number of bboxes to process. i.e. create the boxes, then sort/select, nms and then anotehr sort/select.
Is that complexity smaller than that in the case of SSD? Why would that be?
Please correct me if my understanding is not upto mark: class specific prediction is done only on RPN output boxes(~50) for faster-rcnn In case of SSD its done on all the boxes.
Here is an interesting read: http://openaccess.thecvf.com/content_cvpr_2017/papers/Huang_SpeedAccuracy_Trade-Offs_for_CVPR_2017_paper.pdf
In the case of SSD/RetinaNet, score and bbox predictions are computed using regular convolution layer, which could be accelerated well. For two-stage methods, the ROI pooling and FCs needs to be computed each pruned boxes. The execution time of ROI pooling and FCs relatively higher than convolutions. You may need to consider these trade-offs while making decisions
Please see Table 2 in the RefineDet paper, where they compare several detectors. https://arxiv.org/pdf/1711.06897.pdf
RetinaNet500 seems to be competitive [34.4AP and 14.7AP(Small)] to RefineDet512 and also FasterRCNN by GRMI. It is also much better than SSD512.
Based on this I would conclude that RetinaNet500 is a good choice.
(RetinaNet800 has more complexity and those models with a + use multi scale test - so they are not apple to apple comparison).