Machine-Learning-Collection icon indicating copy to clipboard operation
Machine-Learning-Collection copied to clipboard

too much time in get_evaluation_bboxes in yolo v3

Open beomgonyu opened this issue 3 years ago • 42 comments

it takes so long time. in get_evaluation_bboxes, it takes so much time to run below code, (about more than 10 hours) for idx in range(batch_size): nms_boxes = non_max_suppression( bboxes[idx], iou_threshold=iou_threshold, threshold=threshold, box_format=box_format, )

beomgonyu avatar Jun 23 '21 06:06 beomgonyu

I have the same issue. It progressively gets slower for some reason.

ckyrkou avatar Jun 24 '21 13:06 ckyrkou

Yes i am also experiencing this , struck at batch : 0

100%|██████████████████████████████████████████████████████████████████████████| 26/26 [00:26<00:00,  1.03s/it, loss=55]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:22<00:00,  1.14it/s, loss=51.8]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:23<00:00,  1.13it/s, loss=49.9]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:22<00:00,  1.16it/s, loss=49.2]
100%|███████████████████████████████████████████████████████████████████████████████████| 26/26 [00:07<00:00,  3.64it/s]
Class accuracy is: 9.126985%
No obj accuracy is: 0.085230%
Obj accuracy is: 99.735451%
  0%|                                                                                            | 0/26 [00:00<?, ?it/s]eval batch :  0

guruprasaad123 avatar Jun 24 '21 16:06 guruprasaad123

The way I interpret this is that all candidate boxes are over the threshold so the evaluation takes forever. This might happen because of a very low threshold or the fact that in the beginning the objectless score very high. Because if you look No obj accuracy is very low which means that all boxes are passed as containing an object. I do not know if proper bias/weight initialization can fix this or if increasing the threshold. One thing that I tried is to do the evaluation after 10 epochs where the values will stabilize and not lead to may positive boxes.

ckyrkou avatar Jun 25 '21 06:06 ckyrkou

@ckyrkou Thank you for the reply, Currently ,

  • doing evaluation after 20 epochs
  • increased NMS_IOU_THRESH to 0.75

i am still getting 10647 bounding boxes as below,

Class accuracy is: 35.317459%
No obj accuracy is: 6.079705%
Obj accuracy is: 69.444443%
  0%|                                                                                            | 0/26 [00:00<?, ?it/s]
nme  0
bboxes ,  10647

any thoughts ?

guruprasaad123 avatar Jun 25 '21 09:06 guruprasaad123

The No obj accuracy is still very low. You need to change CONF_THRESHOLD for that. In the original config it is set to 0.05. I used CONF_THRESHOLD = 0.4. You can try that.

ckyrkou avatar Jun 25 '21 10:06 ckyrkou

@ckyrkou Thank you i tried with CONF_THRESHOLD=0.6 , it was working alright @beomgonyu you can please try this and see if that works :-)

guruprasaad123 avatar Jun 26 '21 10:06 guruprasaad123

@guruprasaad123 Good to hear? Did you manage to reproduce the accuracies reported in the repo for pascal_voc?

ckyrkou avatar Jun 26 '21 11:06 ckyrkou

@ckyrkou i tried to reproduce the accuracy that is > 78 for pascal_voc , but i could'nt get to that level as of now. This is what i am getting after 20 epochs,

Class accuracy is: 54.754784%
No obj accuracy is: 100.000000%
Obj accuracy is: 0.000000%

MAP: 0.0

and am still running the script , if i get any improvements on accuracy i would let you know for sure.

guruprasaad123 avatar Jun 26 '21 11:06 guruprasaad123

Thanks. I tried running it for 100 epochs achieving up to 46 map. I was wondering if running for more would increase performance. I noticed that the parameters in the video are different than what is actually is in the repo.

ckyrkou avatar Jun 26 '21 11:06 ckyrkou

@ckyrkou cool , ofcourse parameters are different i too noticed , i too was wondering what could be the ideal parameter to get max mAP > 78 , i am also running for more than 100 epochs if i get improvement i would let you know for sure.

guruprasaad123 avatar Jun 26 '21 11:06 guruprasaad123

Try suppressing the number of boxes by amending the second line in the non_max_suppression function to bboxes = sorted(bboxes, key=lambda x: x[1], reverse=True)[:max_boxes]. Also, evaluate after a couple of epochs so that the model has had the chance to converge a little bit first. I was running evaluation every 20 epochs for the 100examples.csv file.

aningineer avatar Jul 17 '21 07:07 aningineer

@aningineer thank you i will surely try that out !

guruprasaad123 avatar Jul 18 '21 10:07 guruprasaad123

thanks that works @aningineer , i have to set max_boxes = 1024

guruprasaad123 avatar Jul 22 '21 03:07 guruprasaad123

Did you eventually managed to get the reported over 70% map?

ckyrkou avatar Jul 27 '21 11:07 ckyrkou

nope i have not yet , i am still trying @ckyrkou

guruprasaad123 avatar Jul 27 '21 11:07 guruprasaad123

Same here. Haven't been able to reproduce 78% as described.

aningineer avatar Jul 27 '21 13:07 aningineer

I solved this issue by removing evaluation at initial time. it seemed that there are so many predicted box, so it takes long time at initial time. after training is steady, evaluation works well.

beomgonyu avatar Jul 28 '21 00:07 beomgonyu

I solved this issue by removing evaluation at initial time. it seemed that there are so many predicted box, so it takes long time at initial time. after training is steady, evaluation works well.

@beomgonyu What final Map do you get for Pascal VoC??

ckyrkou avatar Aug 06 '21 16:08 ckyrkou

hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?

SimoDB90 avatar Aug 20 '21 11:08 SimoDB90

hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?

@SimoDB90 ,

Did you notice that there are differences between the code of the video and the repository? for example in the config file.

As i read that there were problems i trained 50 epoch without checking the validation and i saved the training file. Then I went back to training using the training file as pre-trained weights and now i can check accuracy(map).

At the moment and as long as my loss is not less than 0 I keep training to try to reach MAP 0.78.

I also added: scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3, verbose=True)

JoaoCH avatar Aug 21 '21 14:08 JoaoCH

hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?

@SimoDB90 ,

Did you notice that there are differences between the code of the video and the repository? for example in the config file.

As i read that there were problems i trained 50 epoch without checking the validation and i saved the training file. Then I went back to training using the training file as pre-trained weights and now i can check accuracy(map).

At the moment and as long as my loss is not less than 0 I keep training to try to reach MAP 0.78.

I also added: scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3, verbose=True)

Yes, i've noticed, and i used repository to clear my code. But i didn't find any impactfull difference, aside from CONF_THRESHOLD. On repository is 0.05, but below 0.6 (even with max_boxes = 1024), the training is painfully slow. But after 10 epochs, map is something like 0.0, or e-5. I'm trying to rerun on train.csv and test.csv, but i'm pretty sure that even on 100 examples or 8 examples, map should go up to 0.9 after few epochs. The fact that doesn't happen is driving me insane. Because i don't know why the sum of TP is always a tensor of 0. My loss is very often NaN. But i really can't understand why. I'm testing with conf_threshold of 0.5

I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more

SimoDB90 avatar Aug 21 '21 15:08 SimoDB90

I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more

@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.

JoaoCH avatar Aug 21 '21 16:08 JoaoCH

I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more

@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.

thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?

SimoDB90 avatar Aug 21 '21 16:08 SimoDB90

I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more

@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.

thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?

i haven't checked it in detail but i think they are fine at least the model is converging despite taking its time:

  • 9m per epox.
  • batchsize =8 I only have 6 GB of VRAM, more than that it fails

JoaoCH avatar Aug 21 '21 16:08 JoaoCH

I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more

@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.

thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?

i haven't checked it in detail but i think they are fine at least the model is converging despite taking its time:

* 9m per epox.

* batchsize =8 I only  have 6 GB of VRAM, more than that it fails

i'm working on google colab pro... my gtx 960 can only run to test if the code works :-) just to know, how many epochs do you need to see some non zero value of MAP?

SimoDB90 avatar Aug 21 '21 16:08 SimoDB90

i'm working on google colab pro... my gtx 960 can only run to test if the code works :-) just to know, how many epochs do you need to see some non zero value of MAP?

I read in this thread that there were some problems at the beginning of the training so I trained without checking the accuracy and after 50 epox I started to check the accuracy.

I think when the loss was less than 20 I already started to have positive values, it's an estimate because I didn't check it at the beginning.

JoaoCH avatar Aug 21 '21 16:08 JoaoCH

fine, thank you. Maybe i stopped the training too early then

SimoDB90 avatar Aug 21 '21 16:08 SimoDB90

well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay

SimoDB90 avatar Aug 21 '21 23:08 SimoDB90

well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay

good morning,

calm everything will resolve itself. must be one of those bugs that sometimes come up. I'm on vacation and I couldn't load the checkpoint, the hotel's WiFi is slow and the file is big. I will try to load later.

update: the checkpoint file is 740MB. I can't upload in hotel wifi, to slow :( In the end of next week I will be at home and i will upload.

by any chance did you try to train with a smaller batch size and in full precision fp32?

JoaoCH avatar Aug 22 '21 10:08 JoaoCH

well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay

@SimoDB90, try to use these pre-trained weights(loss = 1.4) to continue your training: (batch_size=8, lr=1e-6) https://drive.google.com/file/d/1utjhWJ-KB11MsWNhWsE_J3xsh9QDMsLL/view?usp=sharing

I did some tests and got the best Map with CONF_THRESHOLD = 0.05 as it is in the config file in the github repository.

I'm wondering if it's worth continuing training until I have a smaller loss. How cool would it be to do this with vision transformer

JoaoCH avatar Aug 27 '21 15:08 JoaoCH