Machine-Learning-Collection
Machine-Learning-Collection copied to clipboard
too much time in get_evaluation_bboxes in yolo v3
it takes so long time. in get_evaluation_bboxes, it takes so much time to run below code, (about more than 10 hours) for idx in range(batch_size): nms_boxes = non_max_suppression( bboxes[idx], iou_threshold=iou_threshold, threshold=threshold, box_format=box_format, )
I have the same issue. It progressively gets slower for some reason.
Yes i am also experiencing this , struck at batch : 0
100%|██████████████████████████████████████████████████████████████████████████| 26/26 [00:26<00:00, 1.03s/it, loss=55]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:22<00:00, 1.14it/s, loss=51.8]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:23<00:00, 1.13it/s, loss=49.9]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:22<00:00, 1.16it/s, loss=49.2]
100%|███████████████████████████████████████████████████████████████████████████████████| 26/26 [00:07<00:00, 3.64it/s]
Class accuracy is: 9.126985%
No obj accuracy is: 0.085230%
Obj accuracy is: 99.735451%
0%| | 0/26 [00:00<?, ?it/s]eval batch : 0
The way I interpret this is that all candidate boxes are over the threshold so the evaluation takes forever. This might happen because of a very low threshold or the fact that in the beginning the objectless score very high. Because if you look No obj accuracy is very low which means that all boxes are passed as containing an object. I do not know if proper bias/weight initialization can fix this or if increasing the threshold. One thing that I tried is to do the evaluation after 10 epochs where the values will stabilize and not lead to may positive boxes.
@ckyrkou Thank you for the reply, Currently ,
- doing evaluation after
20
epochs - increased
NMS_IOU_THRESH
to 0.75
i am still getting 10647
bounding boxes as below,
Class accuracy is: 35.317459%
No obj accuracy is: 6.079705%
Obj accuracy is: 69.444443%
0%| | 0/26 [00:00<?, ?it/s]
nme 0
bboxes , 10647
any thoughts ?
The No obj accuracy is still very low. You need to change CONF_THRESHOLD for that. In the original config it is set to 0.05. I used CONF_THRESHOLD = 0.4. You can try that.
@ckyrkou Thank you i tried with CONF_THRESHOLD
=0.6 , it was working alright
@beomgonyu you can please try this and see if that works :-)
@guruprasaad123 Good to hear? Did you manage to reproduce the accuracies reported in the repo for pascal_voc?
@ckyrkou i tried to reproduce the accuracy that is > 78 for pascal_voc , but i could'nt get to that level as of now. This is what i am getting after 20 epochs,
Class accuracy is: 54.754784%
No obj accuracy is: 100.000000%
Obj accuracy is: 0.000000%
MAP: 0.0
and am still running the script , if i get any improvements on accuracy i would let you know for sure.
Thanks. I tried running it for 100 epochs achieving up to 46 map. I was wondering if running for more would increase performance. I noticed that the parameters in the video are different than what is actually is in the repo.
@ckyrkou cool , ofcourse parameters are different i too noticed , i too was wondering what could be the ideal parameter to get max mAP > 78 , i am also running for more than 100 epochs if i get improvement i would let you know for sure.
Try suppressing the number of boxes by amending the second line in the non_max_suppression function to bboxes = sorted(bboxes, key=lambda x: x[1], reverse=True)[:max_boxes]
. Also, evaluate after a couple of epochs so that the model has had the chance to converge a little bit first. I was running evaluation every 20 epochs for the 100examples.csv file.
@aningineer thank you i will surely try that out !
thanks that works @aningineer , i have to set max_boxes = 1024
Did you eventually managed to get the reported over 70% map?
nope i have not yet , i am still trying @ckyrkou
Same here. Haven't been able to reproduce 78% as described.
I solved this issue by removing evaluation at initial time. it seemed that there are so many predicted box, so it takes long time at initial time. after training is steady, evaluation works well.
I solved this issue by removing evaluation at initial time. it seemed that there are so many predicted box, so it takes long time at initial time. after training is steady, evaluation works well.
@beomgonyu What final Map do you get for Pascal VoC??
hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?
hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?
@SimoDB90 ,
Did you notice that there are differences between the code of the video and the repository? for example in the config file.
As i read that there were problems i trained 50 epoch without checking the validation and i saved the training file. Then I went back to training using the training file as pre-trained weights and now i can check accuracy(map).
At the moment and as long as my loss is not less than 0 I keep training to try to reach MAP 0.78.
I also added:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3, verbose=True)
hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?
@SimoDB90 ,
Did you notice that there are differences between the code of the video and the repository? for example in the config file.
As i read that there were problems i trained 50 epoch without checking the validation and i saved the training file. Then I went back to training using the training file as pre-trained weights and now i can check accuracy(map).
At the moment and as long as my loss is not less than 0 I keep training to try to reach MAP 0.78.
I also added:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3, verbose=True)
Yes, i've noticed, and i used repository to clear my code. But i didn't find any impactfull difference, aside from CONF_THRESHOLD. On repository is 0.05, but below 0.6 (even with max_boxes = 1024), the training is painfully slow. But after 10 epochs, map is something like 0.0, or e-5. I'm trying to rerun on train.csv and test.csv, but i'm pretty sure that even on 100 examples or 8 examples, map should go up to 0.9 after few epochs. The fact that doesn't happen is driving me insane. Because i don't know why the sum of TP is always a tensor of 0. My loss is very often NaN. But i really can't understand why. I'm testing with conf_threshold of 0.5
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.
thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.
thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?
i haven't checked it in detail but i think they are fine at least the model is converging despite taking its time:
- 9m per epox.
- batchsize =8 I only have 6 GB of VRAM, more than that it fails
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.
thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?
i haven't checked it in detail but i think they are fine at least the model is converging despite taking its time:
* 9m per epox. * batchsize =8 I only have 6 GB of VRAM, more than that it fails
i'm working on google colab pro... my gtx 960 can only run to test if the code works :-) just to know, how many epochs do you need to see some non zero value of MAP?
i'm working on google colab pro... my gtx 960 can only run to test if the code works :-) just to know, how many epochs do you need to see some non zero value of MAP?
I read in this thread that there were some problems at the beginning of the training so I trained without checking the accuracy and after 50 epox I started to check the accuracy.
I think when the loss was less than 20 I already started to have positive values, it's an estimate because I didn't check it at the beginning.
fine, thank you. Maybe i stopped the training too early then
well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay
well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay
good morning,
calm everything will resolve itself. must be one of those bugs that sometimes come up. I'm on vacation and I couldn't load the checkpoint, the hotel's WiFi is slow and the file is big. I will try to load later.
update: the checkpoint file is 740MB. I can't upload in hotel wifi, to slow :( In the end of next week I will be at home and i will upload.
by any chance did you try to train with a smaller batch size and in full precision fp32?
well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay
@SimoDB90, try to use these pre-trained weights(loss = 1.4) to continue your training:
(batch_size=8, lr=1e-6)
https://drive.google.com/file/d/1utjhWJ-KB11MsWNhWsE_J3xsh9QDMsLL/view?usp=sharing
I did some tests and got the best Map with CONF_THRESHOLD = 0.05 as it is in the config file in the github repository.
I'm wondering if it's worth continuing training until I have a smaller loss. How cool would it be to do this with vision transformer