papernotes Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Open howardyclo opened this issue 6 years ago • 0 comments

Metadata

Authors: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
Organization: FAIR
Conference: NIPS 2015
Paper: https://arxiv.org/pdf/1506.01497.pdf

Faster R-CNN

Faster R-CNN with VGG-16

Two modules:

Region Proposal Networks: A FCN (fully convolution network) that proposes regions. (Serve as the "attention" for Fast-RCNN)
Fast R-CNN [1]: A classifier that uses proposed regions.

Region Proposal Networks (RPN)

Input: An image of any size.
Output: A set of rectangular object proposals, each with an objectness score.
Slide a small network over the last conv. feature map.
- Input: n x n spatial window of conv. feature map. (n = 3 in this paper)
- Each spatial window is projected to feature vector (512-d for VGG-16), then fed into two sibling FCs, a box-regression layer (reg) and a box classification layer (cls).
- This architecture is naturally implemented with an n × n conv. layer followed by two sibling 1 × 1 conv. layers (for reg and cls, respectively).
- For each spatial (sliding) window, multiple regions proposals (boxes) are predicted simultaneously, where the number of maximum possible proposals for each location is denoted as k.
  - The reg layer outputs 4k (coordinates of k boxes)
  - The cls layer outputs 2k scores of being a object or not for k boxes.
  - The k proposals are parameterized relative to k reference boxes, which we call anchors.
  - Anchor: An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio. (This paper uses 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position. For conv. feature map of a size W × H (typically ∼2,400), there are W x H x k anchors in total.
RPN is translation-invariant. (Guarantee that the same proposal is generated if an object is translated).
- (4 + 2) x 9-d conv. output layer in the case of k = 9 anchors.
- Considering the feature projection layers, our proposal layers parameter count is 3 × 3 × 512 × 512 + 512 × (4 + 2) × 9 = 2.4 × 10^6.
Multi-scale anchors as regression references:
- A pyramid of anchors: Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios.
- It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size.
- The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.
Loss function:
- Binary label (being an object or not) for each anchor.
- Positive label:
  - The anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box.
  - An anchor that has an IoU overlap higher than 0.7 with any ground-truth box.
  - Note that a single ground-truth box may assign positive labels to multiple anchors.
- Negative label: The anchor's IoU ratio is lower than 0.3 for all ground-truth boxes.
- Anchors that are neither positive nor negative do not contribute to the training objective.
- Minimize the multi-task loss in Fast R-CNN:
- i: An index of an anchor in a mini-batch; p_{i}: Prob. of being an object; t_{i}: A vector representing the 4 parameterized coordinates of the predicted bounding box; *t_{i}**: Ground-truth bounding boxes associated with a positive anchor.
- L_{cls}: log-loss; L_{reg}: smoothed L1 loss.
Parameterizations of the 4 coordinates:
- x, y, w, and h denote the box’s center coordinates and its width and height.
- Variables x, x_{a}*, and x are for the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, h).
- Can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.
- Bounding-box regression: The features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.
Training:
- Follow the "image-centric" sampling strategy [1].
- It is possible to optimize for the loss functions for all anchors, but this will bias toward negative samples as they are dominated.
- Randomly sample 256 anchors in an image where positive:negative = 1:1. Pad the mini-batch with negative ones if there're fewer than 128 positive sample anchors.
- Adopt 4-Step Alternating Training
  - Initialized with ImageNet-pretrained model and fine-tuned end-to-end for the region proposal task.
  - Train a separate detection network (also initialized with ImageNet-pretrained model) by Fast R-CNN using the proposals generated by the step-1 RPN.
  - At this point the two networks do not share conv. layers.
  - Use the detector network to initialize RPN training, but we fix the shared conv. layers and only fine-tune the layers unique to RPN.
  - Finally, keeping the shared conv. layers fixed, we fine-tune the unique layers of Fast R-CNN.
Implementation and hyperparameter details are provided in the paper.