papernotes
papernotes copied to clipboard
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Metadata
- Authors: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
- Organization: FAIR
- Conference: NIPS 2015
- Paper: https://arxiv.org/pdf/1506.01497.pdf
Faster R-CNN
Faster R-CNN with VGG-16
Two modules:
- Region Proposal Networks: A FCN (fully convolution network) that proposes regions. (Serve as the "attention" for Fast-RCNN)
- Fast R-CNN [1]: A classifier that uses proposed regions.
Region Proposal Networks (RPN)
- Input: An image of any size.
- Output: A set of rectangular object proposals, each with an objectness score.
- Slide a small network over the last conv. feature map.
- Input: n x n spatial window of conv. feature map. (n = 3 in this paper)
- Each spatial window is projected to feature vector (512-d for VGG-16), then fed into two sibling FCs, a box-regression layer (reg) and a box classification layer (cls).
- This architecture is naturally implemented with an n × n conv. layer followed by two sibling 1 × 1 conv. layers (for reg and cls, respectively).
- For each spatial (sliding) window, multiple regions proposals (boxes) are predicted simultaneously, where the number of maximum possible proposals for each location is denoted as k.
- The reg layer outputs 4k (coordinates of k boxes)
- The cls layer outputs 2k scores of being a object or not for k boxes.
- The k proposals are parameterized relative to k reference boxes, which we call anchors.
- Anchor: An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio. (This paper uses 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position. For conv. feature map of a size W × H (typically ∼2,400), there are W x H x k anchors in total.
- RPN is translation-invariant. (Guarantee that the same proposal is generated if an object is translated).
- (4 + 2) x 9-d conv. output layer in the case of k = 9 anchors.
- Considering the feature projection layers, our proposal layers parameter count is 3 × 3 × 512 × 512 + 512 × (4 + 2) × 9 = 2.4 × 10^6.
- Multi-scale anchors as regression references:
- A pyramid of anchors: Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios.
- It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size.
- The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.
- Loss function:
- Binary label (being an object or not) for each anchor.
- Positive label:
- The anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box.
- An anchor that has an IoU overlap higher than 0.7 with any ground-truth box.
- Note that a single ground-truth box may assign positive labels to multiple anchors.
- Negative label: The anchor's IoU ratio is lower than 0.3 for all ground-truth boxes.
- Anchors that are neither positive nor negative do not contribute to the training objective.
- Minimize the multi-task loss in Fast R-CNN:
- i: An index of an anchor in a mini-batch; p_{i}: Prob. of being an object; t_{i}: A vector representing the 4 parameterized coordinates of the predicted bounding box; *t_{i}**: Ground-truth bounding boxes associated with a positive anchor.
- L_{cls}: log-loss; L_{reg}: smoothed L1 loss.
- Parameterizations of the 4 coordinates:
- x, y, w, and h denote the box’s center coordinates and its width and height.
- Variables x, x_{a}*, and x are for the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, h).
- Can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.
- Bounding-box regression: The features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.
- Training:
- Follow the "image-centric" sampling strategy [1].
- It is possible to optimize for the loss functions for all anchors, but this will bias toward negative samples as they are dominated.
- Randomly sample 256 anchors in an image where positive:negative = 1:1. Pad the mini-batch with negative ones if there're fewer than 128 positive sample anchors.
- Adopt 4-Step Alternating Training
- Initialized with ImageNet-pretrained model and fine-tuned end-to-end for the region proposal task.
- Train a separate detection network (also initialized with ImageNet-pretrained model) by Fast R-CNN using the proposals generated by the step-1 RPN.
- At this point the two networks do not share conv. layers.
- Use the detector network to initialize RPN training, but we fix the shared conv. layers and only fine-tune the layers unique to RPN.
- Finally, keeping the shared conv. layers fixed, we fine-tune the unique layers of Fast R-CNN.
- Implementation and hyperparameter details are provided in the paper.
Reference
- [1] Fast R-CNN by Ross Girshick. ICCV 2015.