darknet Training YOLOv3 with own dataset

Training YOLOv3 with own dataset

Open sonalambwani opened this issue 6 years ago • 234 comments

Hi everyone, Has anyone had success with training YOLOv3 for their own datasets? If so, could you help sort out some questions for me:

For me, I have a 5 class object detection problem. In the .cfg file, I have changed the number of classes and the number of filters to 3*(num_classes+5) = 30 in 3 different places. I can initiate the training but the loss blows up to start with and I am seeing a bunch of nans in the output massage (see snippet)

Here are my questions:

Did you need to change the anchor box sizes and/or the number of anchors?
Did you need to create the labels differently than for YOLO v2?

Thanks!

Mar 29 '18 19:03 sonalambwani

No you don't need to change your training set. You need to calculate your anchors as previously on yolo2 but multiply by 32 (and round). Then split the anchors among the layers. If you have 9 anchors you can split them 3 ways, but decide based on size. Each anchor should have 5+number of objects filters. I got ok results with the default anchors but you could recompute. Remember your anchor calc should be the same scale as the input size for the network.

Mar 29 '18 20:03 ndg123

@sonalambwani Just wait about 1000 iterations, and nan will disappear: https://github.com/AlexeyAB/darknet/issues/504#issuecomment-377290060

You can re-calculate anchors, but it is not necessary. You can calculate anchors for Yolo v3 using this fork: https://github.com/AlexeyAB/darknet and this command if in your cfg-file width=416 and height=416: darknet.exe detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416

This anchors you can use in your cfg-file (without multiplication by 32)

You can use the same labels as for Yolo v2

Mar 29 '18 21:03 AlexeyAB

@AlexeyAB ,hello,but after waiting about 1000 iterations, and nan still appear:

Mar 30 '18 02:03 ss199302

Hi, I am trying to do Training YOLO on VOC.

below command i ma using , ./darknet detector train cfg/voc.data cfg/yolov3-voc.cfg darknet53.conv.74

But nans keep on increasing. Is it normal or some issue. Loaded: 0.000063 seconds Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: nan, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: nan, .5R: -nan, .75R: -nan, count: 0 3296: -nan, nan avg, 0.001000 rate, 0.416401 seconds, 3296 images

Mar 30 '18 06:03 satya2550

I have a same issue. The error is shown in second yolo layer. Did you solve this problem?

Mar 30 '18 09:03 springkim

same tooooo

Mar 30 '18 10:03 DSpringQ

@ss199302 If there are only some nan then training goes well, but if there are all nan then training goes wrong.

Mar 30 '18 10:03 AlexeyAB

@AlexeyAB As you suggested, I am now training with my new dataset with the default COCO anchor boxes. I am training from "Scratch", i.e., no initialization with the pretrained convolutional weights as you have done in https://github.com/AlexeyAB/darknet/issues/504#issuecomment-377290060

For me, I see nans even after 2500 iterations. The loss (after starting off really high) has dropped within a reasonable range, but there is more of a fluctuation between the loss for each mini-batch.

Have you, or anyone else here, noticed similar behavior?

Mar 30 '18 14:03 sonalambwani

For me, I see nans even after 2500 iterations.

All lines have nan values or only some lines?
How many classes and images in your dataset? And what tool did you use for labeling?
What batch and subdivision do you use?
Do you use random=1?
Do you train using multi-GPU?

Mar 30 '18 14:03 AlexeyAB

It's just a few lines with nans.

Used an in-house tool for labeling.

batch=16, subdivisions = 16

Not sure about random=1. Where do I check/set that??

It's a single GPU.

Mar 30 '18 14:03 sonalambwani

@AlexeyAB "How many classes and images in your dataset? And what tool did you use for labeling?"

5 classes, ~17k images in the training set.

Mar 30 '18 14:03 sonalambwani

@sonalambwani Looks like normal output of training.

Mar 30 '18 14:03 AlexeyAB

You have batch and subdivision 16. That means one image per iteration and depending on the density of objects in your images, it's possible that no object will be found in a given layer which will lead to nan. Also depends if the ground truths are similar to the anchors. If they are all very small for all very large then you may not detect them in the very large or very small layers.

So I agree with alexeyAB that this looks normal. Can you reduce the subdivisions so more images per mini batch.

Mar 30 '18 15:03 ndg123

I have a same issue default batch=64, subdivisions = 8 I have followed this instruction I really didn't understand if I should change anchors in yolo-obj.cfg when i have own dataset.

Mar 30 '18 15:03 UgolUgol

@ndg123 Thank you for your suggestions. I am now testing with batch = 64 and subdiv=16. Right off the bat, I see fewer nans. There are a few, but it's looking better.

Mar 30 '18 15:03 sonalambwani

per my training on customer dataset, If not all of them are nans, it is fine. since 3 different scales, that means in some scale, no object is detected. you may try different input image size, or divide into 2, or 4 different scales instead of 3? then the number of nans should change.

Mar 30 '18 18:03 lynnw123

@AlexeyAB thanks your reply,but i can't test anything

Apr 02 '18 01:04 ss199302

@AlexeyAB darknet.exe detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416,this command how to write in ubuntu darknet?

Apr 02 '18 02:04 ss199302

@ss199302 ./darknet detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416

Apr 04 '18 11:04 TheMikeyR

I am trying to run calc_anchors in linux using what @TheMikeyR says and it returns to command line immediately and gives no output. Is it supposed to print the anchors to std out? I'm new to C. Where can I find the code this command runs?

Also, I'm training on my own data, and the bounding boxes in my training data are all the exact same size, and they are all squares. Do I still need to specifiy more than one anchor?

Apr 05 '18 00:04 brieh

Is it possible to detect the Signature (or any handwritten area) in printed receipts using YOLO ? which would be the best cfg file for the same, and any suggestions before I start ?

Apr 05 '18 03:04 AbhishekAshokDubey

@brieh try Alexey repo https://github.com/AlexeyAB/darknet Here is the code https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L839

Apr 05 '18 06:04 TheMikeyR

@TheMikeyR Thanks. I was using the pjreddie fork.

Apr 05 '18 17:04 brieh

@AlexeyAB hello! I use this command ./darknet detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416 to get anchors,but it don't return anything.

Apr 12 '18 07:04 ss199302

@ss199302 same for me. Have u found the solution?

Apr 12 '18 09:04 ntudy

@ss199302 @spenceryue97 did you create the labels (*.txt) files first?

Apr 12 '18 16:04 sonalambwani

@ss199302 @spenceryue97 and you're definitely using AlexeyAB's fork?

I never ended up getting it working. I didn't want to switch to AlexeyAB's fork because we've modified our fork of pjreddie's fork. I tried copy/pasting the code that does the clustering from AlexeyAB's detector.c to the one I have and remaking, but still gave no output.

Apr 12 '18 16:04 brieh

@sonalambwani Yes

Apr 13 '18 02:04 ntudy

@brieh I'm using pjreddie's repo

Apr 13 '18 02:04 ntudy

@spenceryue97 @brieh you can just get AlexeyAB's fork, run the calc_anchors and then take the numbers to your cfg in pjreddie's repo.

Apr 13 '18 06:04 TheMikeyR

darknet darknet copied to clipboard

Training YOLOv3 with own dataset

darknet
darknet copied to clipboard