darknet icon indicating copy to clipboard operation
darknet copied to clipboard

Can someone clarify the anchor box concept used in Yolo?

Open hbzhang opened this issue 6 years ago • 66 comments

I know this might be too simple for many of you. But I can not seem to find a good literature illustrating clearly and definitely for the idea and concept of anchor box in Yolo (V1,V2, andV3). Thanks!

hbzhang avatar Mar 26 '18 18:03 hbzhang

Here's a quick explanation based on what I understand (which might be wrong but hopefully gets the gist of it). After doing some clustering studies on ground truth labels, it turns out that most bounding boxes have certain height-width ratios. So instead of directly predicting a bounding box, YOLOv2 (and v3) predict off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes are the anchor boxes.

vkmenon avatar Mar 26 '18 19:03 vkmenon

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map): https://github.com/pjreddie/darknet/blob/6f6e4754ba99e42ea0870eb6ec878e48a9f7e7ae/src/yolo_layer.c#L88-L89

  • x[...] - outputs of the neural network

  • biases[...] - anchors

  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

AlexeyAB avatar Mar 26 '18 19:03 AlexeyAB

Thanks!

hbzhang avatar Mar 28 '18 01:03 hbzhang

For YoloV2 (5 anchors) and YoloV3 (9 anchors) is it advantageous to use more anchors? For example, if I have one class (face), should I stick with the default number of anchors or could I potentially get higher IoU with more?

spinoza1791 avatar Apr 11 '18 14:04 spinoza1791

For YoloV2 (5 anchors) and YoloV3 (9 anchors) is it advantageous to use more anchors? For example, if I have one class (face), should I stick with the default number of anchors or could I potentially get higher IoU with more?

I was wondering the same. The more anchors used, the higher the IoU; see (https://medium.com/@vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807). However, when you try to detect one class, which often show the same object aspect ratios (like faces) I don't think that increasing the number of anchors is going to increase the IoU by a lot. While the computational overhead is going to increase significantly.

CageCode avatar Apr 18 '18 09:04 CageCode

I used YOLOv2 to predict some industry meter board few weeks ago and I try the same idea spinoza1791 and CageCode refered, The reason was that I need high accuracy but also want close to real time so I thought change num of anchors (YOLOv2 -> 5) but it all end to crush after about 1800 iteration So I might lose someing there

fkoorc avatar Apr 22 '18 12:04 fkoorc

@AlexeyAB How do you get the initial anchor box dimensions after clustering? The width and height after clustering are all number s less than 1, but anchor box dimensions are greater of less than 1. How to get the anchor box dimensions?

frozenscrypt avatar Sep 10 '18 02:09 frozenscrypt

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2stride]) * biases[2n] / w; b.h = exp(x[index + 3stride]) * biases[2n+1] / h;

* `x[...]` - outputs of the neural network

* `biases[...]` - anchors

* `b.w` and `b.h` result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

great explanation bro. thank you.

saiteja011 avatar Sep 21 '18 16:09 saiteja011

Sorry, still unclear phrase In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map What are "final feature map" sizes? For yolo-voc.2.0.cfg input image size is 416x416, anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52. I got- each pair represents anchor width and height, centered in every of 13X13 cells. The last anchor- 16.62 (width?), 10.52(height?)-what units they are? Can somebody explain litterally with this example? And, may be, someone uploaded the code for deducing best anchors from given dataset with K-means?

andyrey avatar Nov 05 '18 14:11 andyrey

I think maybe your anchor has some error. In yolo2 the anchor size is based on final feature map(13x13) as you said. So the anchor aspect ratio must be smaller than 13x13 But in yolo3 the author changed anchor size based on initial input image size. As author said: "In YOLOv3 anchor sizes are actual pixel values. this simplifies a lot of stuff and was only a little bit harder to implement" Hope I am not missing anything :)

fkoorc avatar Nov 09 '18 04:11 fkoorc

Dears,

is it necessarily to get the anchors values before the training to enhance the model?

I am building my own data set to detect 6 classes using tiny yolov2 and I used the below code to get anchors values do I need to change the width and height if I am changing it in the cfg file ?

are the below anchors accepted or the values are huge values ? what is the num_of_clusters 9 ?

....\build\darknet\x64>darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416

num_of_clusters = 9, width = 416, height = 416 read labels from 8297 images loaded image: 2137 box: 7411

Wrong label: data/obj/IMG_0631.txt - j = 0, x = 1.332292, y = 1.399537, width = 0.177083, height = 0.412037 loaded image: 2138 box: 7412 calculating k-means++ ...

avg IoU = 59.41 %

Saving anchors to the file: anchors.txt anchors = 19.2590,25.4234, 42.6678,64.3841, 36.4643,117.4917, 34.0644,235.9870, 47.0470,171.9500, 220.3569,59.5293, 48.2070,329.3734, 99.0149,240.3936, 165.5850,351.2881

jalaldev1980 avatar Nov 14 '18 16:11 jalaldev1980

To get anchor value first makes training time faster but not necessary tiny yolo is not quite accuracy if you can I adjust you use yolov2

fkoorc avatar Nov 21 '18 04:11 fkoorc

@jalaldev1980 I try to guess, where did you take this calc_anchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?

andyrey avatar Nov 21 '18 05:11 andyrey

@jalaldev1980 I try to guess, where did you take this calc_anchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?

./darknet detector calc_anchors your_obj.data -num_of_clusters 9 -width 416 -height 416

developer0hye avatar Nov 22 '18 04:11 developer0hye

check How to improve object detection section at https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

for tiny yolo check the comments at https://github.com/pjreddie/darknet/issues/911

let me know if you find any other resources and advices

On Wed, 21 Nov 2018 at 09:34, andyrey [email protected] wrote:

@jalaldev1980 https://github.com/jalaldev1980 I try to guess, where did you take this calc_anchors flag in your command line? I didn't find it in YOLO-2, may be, it is in YOLO-3 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pjreddie/darknet/issues/568#issuecomment-440536980, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq5IBlNGUlzAo6_rYn4j0sN6gOXWFiayks5uxOX7gaJpZM4S7tc_ .

jalaldev1980 avatar Nov 23 '18 06:11 jalaldev1980

Can someone provide some insights into YOLOv3's time complexity if we change the number of anchors?

NadimSKanaan avatar Nov 26 '18 07:11 NadimSKanaan

Hi guys,

I got to know that yolo3 employs 9 anchors, but there are three layers used to generate yolo targets. Does this mean, each yolo target layer should have 3 anchors at each feature point according to their scale as does in FPN, or do we need to match all 9 anchors with one gt on all the 3 yolo output layers?

CoinCheung avatar Jan 10 '19 04:01 CoinCheung

I use single set of 9 anchors for all of 3 layers in cfg file, it works fine. I believe, this set is for one base scale, and rescaled in the other 2 layers somewhere in framework code. Let someone correct me, if I am wrong.

andyrey avatar Jan 10 '19 09:01 andyrey

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c Lines 88 to 89 in 6f6e475 b.w = exp(x[index + 2stride]) * biases[2n] / w; b.h = exp(x[index + 3stride]) * biases[2n+1] / h;

  • x[...] - outputs of the neural network
  • biases[...] - anchors
  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Thanks, but why darknet's yolov3 config file https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-voc.cfg and https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg have different input size(416 and 608), but use the same anchor size?If yolo v3 anchors are sizes of objects on the image that resized to the network size.

weiaicunzai avatar Jan 14 '19 07:01 weiaicunzai

@weiaicunzai You are right, 2 different input size (416 and 608) cfg files have the same anchor box sizes. Seems to be a mistake. As for me, I use utilite to find anchors specific to my dataset, it increases accuracy.

andyrey avatar Jan 14 '19 07:01 andyrey

Hi, Here I have some anchor question please: If I did not misunderstand the paper, there is also a positive-negative mechanism in yolov3, but only when we compute confidence loss, since xywh and classification only rely on the best match. Thus the xywh loss and classification loss are computed with gt and only one associated match. As for the confidence, the division of positive and negative is based on the iou value. Here my question is: is this iou computed between gt and the anchors, or between gt and the predictions which are computed from anchor and the model outputs(output is the offset generated from the model)?

CoinCheung avatar Jan 14 '19 08:01 CoinCheung

Say I have a situation where all my objects that I need to detect are of the same size 30x30 pixels on an image that is 295x295 pixels, how would I go about calculating the best anchors for yolo v2 to use during training?

Sauraus avatar Jan 15 '19 21:01 Sauraus

@Sauraus There is special python program, see AlexeyAB reference on github, which calculates 5 best anchors based on your dataset variety(for YOLO-2). Very easy to use. Then replace string with new anchor boxes in your cfg file. If you have same size objects, it probably would give you set of same pair of digits.

andyrey avatar Jan 16 '19 06:01 andyrey

@andyrey are you referring to this: https://github.com/AlexeyAB/darknet/blob/master/scripts/gen_anchors.py by any chance?

Sauraus avatar Jan 16 '19 16:01 Sauraus

@Sauraus: Yes, I used this for YOLO-2 with cmd: python gen_anchors.py -filelist train.txt -output_dir ./ -num_clusters 5

and for 9 anchors for YOLO-3 I used C-language darknet: darknet3.exe detector calc_anchors obj.data -num_of_clusters 9 -width 416 -height 416 -showpause

andyrey avatar Jan 16 '19 16:01 andyrey

Is anyone facing an issue with YoloV3 prediction where occasionally bounding box centre are either negative or overall bounding box height/width exceeds the image size?

pkhigh avatar Feb 15 '19 14:02 pkhigh

Yes and it's driving me crazy.

Is anyone facing an issue with YoloV3 prediction where occasionally bounding box centre are either negative or overall bounding box height/width exceeds the image size?

Sauraus avatar Feb 18 '19 21:02 Sauraus

I think that the bounding box is hard to precisely fit your target There is always some deviation, just how much the degree of error it is. If the error is very large maybe you should check your training data and test data But still there is so many possible reason cause that Maybe you can post your picture?

fkoorc avatar Feb 19 '19 08:02 fkoorc

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475

b.w = exp(x[index + 2stride]) * biases[2n] / w; b.h = exp(x[index + 3stride]) * biases[2n+1] / h;

  • x[...] - outputs of the neural network
  • biases[...] - anchors
  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Can someone clarify why we take the exponential of predicted widths and heights? Why not just multiple anchor coordinates with them instead of taking exponential first?

atulshanbhag avatar Feb 19 '19 09:02 atulshanbhag

Extremely useful discussion - thanks all - have been trying to understand Azure Cognitive Services / Microsoft Custom Vision object detection. I had been wondering where their exported anchor values came from. It's now fairly clear they do transfer learning off YOLO.

jtlz2 avatar Feb 28 '19 12:02 jtlz2