papernotes icon indicating copy to clipboard operation
papernotes copied to clipboard

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Open howardyclo opened this issue 6 years ago • 0 comments


  • Authors: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
  • Organization: FAIR
  • Conference: NIPS 2015
  • Paper:

Faster R-CNN

Faster R-CNN with VGG-16

Two modules:

  • Region Proposal Networks: A FCN (fully convolution network) that proposes regions. (Serve as the "attention" for Fast-RCNN)
  • Fast R-CNN [1]: A classifier that uses proposed regions.

Region Proposal Networks (RPN)

  • Input: An image of any size.
  • Output: A set of rectangular object proposals, each with an objectness score.
  • Slide a small network over the last conv. feature map.
    • Input: n x n spatial window of conv. feature map. (n = 3 in this paper)
    • Each spatial window is projected to feature vector (512-d for VGG-16), then fed into two sibling FCs, a box-regression layer (reg) and a box classification layer (cls).
    • This architecture is naturally implemented with an n × n conv. layer followed by two sibling 1 × 1 conv. layers (for reg and cls, respectively).
    • For each spatial (sliding) window, multiple regions proposals (boxes) are predicted simultaneously, where the number of maximum possible proposals for each location is denoted as k.
      • The reg layer outputs 4k (coordinates of k boxes)
      • The cls layer outputs 2k scores of being a object or not for k boxes.
      • The k proposals are parameterized relative to k reference boxes, which we call anchors.
      • Anchor: An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio. (This paper uses 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position. For conv. feature map of a size W × H (typically ∼2,400), there are W x H x k anchors in total.
  • RPN is translation-invariant. (Guarantee that the same proposal is generated if an object is translated).
    • (4 + 2) x 9-d conv. output layer in the case of k = 9 anchors.
    • Considering the feature projection layers, our proposal layers parameter count is 3 × 3 × 512 × 512 + 512 × (4 + 2) × 9 = 2.4 × 10^6.
  • Multi-scale anchors as regression references:
    • A pyramid of anchors: Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios.
    • It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size.
    • The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.
  • Loss function:
    • Binary label (being an object or not) for each anchor.
    • Positive label:
      • The anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box.
      • An anchor that has an IoU overlap higher than 0.7 with any ground-truth box.
      • Note that a single ground-truth box may assign positive labels to multiple anchors.
    • Negative label: The anchor's IoU ratio is lower than 0.3 for all ground-truth boxes.
    • Anchors that are neither positive nor negative do not contribute to the training objective.
    • Minimize the multi-task loss in Fast R-CNN:
    • i: An index of an anchor in a mini-batch; p_{i}: Prob. of being an object; t_{i}: A vector representing the 4 parameterized coordinates of the predicted bounding box; *t_{i}**: Ground-truth bounding boxes associated with a positive anchor.
    • L_{cls}: log-loss; L_{reg}: smoothed L1 loss.
  • Parameterizations of the 4 coordinates:
    • x, y, w, and h denote the box’s center coordinates and its width and height.
    • Variables x, x_{a}*, and x are for the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, h).
    • Can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.
    • Bounding-box regression: The features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.
  • Training:
    • Follow the "image-centric" sampling strategy [1].
    • It is possible to optimize for the loss functions for all anchors, but this will bias toward negative samples as they are dominated.
    • Randomly sample 256 anchors in an image where positive:negative = 1:1. Pad the mini-batch with negative ones if there're fewer than 128 positive sample anchors.
    • Adopt 4-Step Alternating Training
      • Initialized with ImageNet-pretrained model and fine-tuned end-to-end for the region proposal task.
      • Train a separate detection network (also initialized with ImageNet-pretrained model) by Fast R-CNN using the proposals generated by the step-1 RPN.
      • At this point the two networks do not share conv. layers.
      • Use the detector network to initialize RPN training, but we fix the shared conv. layers and only fine-tune the layers unique to RPN.
      • Finally, keeping the shared conv. layers fixed, we fine-tune the unique layers of Fast R-CNN.
  • Implementation and hyperparameter details are provided in the paper.


Further Reading

howardyclo avatar Dec 01 '18 10:12 howardyclo