darknet
darknet copied to clipboard
Training YOLOv3 with own dataset
Hi everyone, Has anyone had success with training YOLOv3 for their own datasets? If so, could you help sort out some questions for me:
For me, I have a 5 class object detection problem. In the .cfg file, I have changed the number of classes and the number of filters to 3*(num_classes+5) = 30 in 3 different places. I can initiate the training but the loss blows up to start with and I am seeing a bunch of nans in the output massage (see snippet)
Here are my questions:
- Did you need to change the anchor box sizes and/or the number of anchors?
- Did you need to create the labels differently than for YOLO v2?
Thanks!
No you don't need to change your training set. You need to calculate your anchors as previously on yolo2 but multiply by 32 (and round). Then split the anchors among the layers. If you have 9 anchors you can split them 3 ways, but decide based on size. Each anchor should have 5+number of objects filters. I got ok results with the default anchors but you could recompute. Remember your anchor calc should be the same scale as the input size for the network.
@sonalambwani Just wait about 1000 iterations, and nan
will disappear: https://github.com/AlexeyAB/darknet/issues/504#issuecomment-377290060
- You can re-calculate anchors, but it is not necessary. You can calculate anchors for Yolo v3 using this fork: https://github.com/AlexeyAB/darknet
and this command if in your cfg-file
width=416
andheight=416
:darknet.exe detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416
This anchors you can use in your cfg-file (without multiplication by 32)
- You can use the same labels as for Yolo v2
@AlexeyAB ,hello,but after waiting about 1000 iterations, and nan still appear:
Hi, I am trying to do Training YOLO on VOC.
below command i ma using , ./darknet detector train cfg/voc.data cfg/yolov3-voc.cfg darknet53.conv.74
But nans keep on increasing. Is it normal or some issue. Loaded: 0.000063 seconds Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1 Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: nan, .5R: -nan, .75R: -nan, count: 0 Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: nan, .5R: -nan, .75R: -nan, count: 0 3296: -nan, nan avg, 0.001000 rate, 0.416401 seconds, 3296 images
I have a same issue. The error is shown in second yolo layer. Did you solve this problem?
same tooooo
@ss199302
If there are only some nan
then training goes well, but if there are all nan
then training goes wrong.
@AlexeyAB As you suggested, I am now training with my new dataset with the default COCO anchor boxes. I am training from "Scratch", i.e., no initialization with the pretrained convolutional weights as you have done in https://github.com/AlexeyAB/darknet/issues/504#issuecomment-377290060
For me, I see nans even after 2500 iterations. The loss (after starting off really high) has dropped within a reasonable range, but there is more of a fluctuation between the loss for each mini-batch.
Have you, or anyone else here, noticed similar behavior?
For me, I see nans even after 2500 iterations.
-
All lines have
nan
values or only some lines? -
How many classes and images in your dataset? And what tool did you use for labeling?
-
What batch and subdivision do you use?
-
Do you use
random=1
? -
Do you train using multi-GPU?
It's just a few lines with nans.
Used an in-house tool for labeling.
batch=16, subdivisions = 16
Not sure about random=1. Where do I check/set that??
It's a single GPU.
@AlexeyAB "How many classes and images in your dataset? And what tool did you use for labeling?"
5 classes, ~17k images in the training set.
@sonalambwani Looks like normal output of training.
You have batch and subdivision 16. That means one image per iteration and depending on the density of objects in your images, it's possible that no object will be found in a given layer which will lead to nan. Also depends if the ground truths are similar to the anchors. If they are all very small for all very large then you may not detect them in the very large or very small layers.
So I agree with alexeyAB that this looks normal. Can you reduce the subdivisions so more images per mini batch.
I have a same issue
batch=64, subdivisions = 8
I have followed this instruction
I really didn't understand if I should change anchors in yolo-obj.cfg when i have own dataset.
@ndg123 Thank you for your suggestions. I am now testing with batch = 64 and subdiv=16. Right off the bat, I see fewer nans. There are a few, but it's looking better.
per my training on customer dataset, If not all of them are nans, it is fine. since 3 different scales, that means in some scale, no object is detected. you may try different input image size, or divide into 2, or 4 different scales instead of 3? then the number of nans should change.
@AlexeyAB thanks your reply,but i can't test anything
@AlexeyAB darknet.exe detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416,this command how to write in ubuntu darknet?
@ss199302 ./darknet detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416
I am trying to run calc_anchors in linux using what @TheMikeyR says and it returns to command line immediately and gives no output. Is it supposed to print the anchors to std out? I'm new to C. Where can I find the code this command runs?
Also, I'm training on my own data, and the bounding boxes in my training data are all the exact same size, and they are all squares. Do I still need to specifiy more than one anchor?
Is it possible to detect the Signature (or any handwritten area) in printed receipts using YOLO ? which would be the best cfg file for the same, and any suggestions before I start ?
@brieh try Alexey repo https://github.com/AlexeyAB/darknet Here is the code https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L839
@TheMikeyR Thanks. I was using the pjreddie fork.
@AlexeyAB hello! I use this command ./darknet detector calc_anchors data/voc.data -num_of_clusters 9 -width 416 -heigh 416 to get anchors,but it don't return anything.
@ss199302 same for me. Have u found the solution?
@ss199302 @spenceryue97 did you create the labels (*.txt) files first?
@ss199302 @spenceryue97 and you're definitely using AlexeyAB's fork?
I never ended up getting it working. I didn't want to switch to AlexeyAB's fork because we've modified our fork of pjreddie's fork. I tried copy/pasting the code that does the clustering from AlexeyAB's detector.c to the one I have and remaking, but still gave no output.
@sonalambwani Yes
@brieh I'm using pjreddie's repo
@spenceryue97 @brieh you can just get AlexeyAB's fork, run the calc_anchors and then take the numbers to your cfg in pjreddie's repo.