tfyolo icon indicating copy to clipboard operation
tfyolo copied to clipboard

loss nan

Open tak-s opened this issue 3 years ago • 7 comments

Thank you for providing a useful repository.

I run this train code on TF2.4.

python train.py --train_annotations_dir ../data/voc/voc_train.txt --test_annotations_dir ../data/voc/voc_test.txt --class_name_dir ../data/voc/voc.names --multi_gpus 2

After 5k iteration, loss is nan...

Please show us your training parameters and result information on this repo.

tak-s avatar Mar 01 '21 04:03 tak-s

@tak-s Hey, if loss is nan and the training steps are more than one epoch and less than the warmup epochs, you can try decreasing the learning rate.

As for the performance, I tried it in mnist data with 1000 train samples and 1000 test samples, the model is 95 map compared to 93 map of yolov3. But when I tried it in voc dataset without pretraining, the current v5 is 23 map compared to 40 map of yolov3. I'm trying to figure it out, but too busy in daily work recently.

the possible reason is:

  • the loss weight is not so good, currently the classification loss is multiplied number of class to balance it with the iou loss and confidence loss. Maybe it's not so good.

Any suggestion from your side is also welcomed

LongxingTan avatar Mar 02 '21 01:03 LongxingTan

Is the VOC dataset obtained with get_voc.sh when you get mAP = 0.23? And are the parameters in train.py the default? Just run "python train.py"?

I would like to start by matching with your learning results.

tak-s avatar Mar 03 '21 07:03 tak-s

@tak-s yes, that's right.

  1. run get_voc.sh first
  2. python dataset/prepare_data.py
  3. python train.py

but I checked my result, I guess i only run 20 epochs for that result due to my limited computation power. The default epochs in config.py is 30. And i just commit my latest local version in case i missed anything

LongxingTan avatar Mar 03 '21 07:03 LongxingTan

@tak-s yes, that's right.

  1. run get_voc.sh first
  2. python dataset/prepare_data.py
  3. python train.py

but I checked my result, I guess i only run 20 epochs for that result due to my limited computation power. The default epochs in config.py is 30. And i just commit my latest local version in case i missed anything

作者您好,谢谢提供的有关信息,可是我在coco数据集上运行了train.py,还是会出现loss为nan的情况,请问有什么建议吗?

gongkecun avatar Apr 12 '21 13:04 gongkecun

@gongkecun

你好,loss还是出现NAN的话,建议按一下常见步骤查看:

  • 确保输入数据没有nan、没有越界、格式与处理时相符,如果nan出现在第一个epoch可能是数据的原因,也可能是学习率过大的原因
  • 数据没问题的前提下,如果nan出现在大于1epoch,小于warmup-epochs阶段,则降低warmup阶段最大学习率

LongxingTan avatar Apr 13 '21 01:04 LongxingTan

感觉时代码有bug,作者不打算修一修? @LongxingTan

sunpeng981712364 avatar Apr 25 '21 12:04 sunpeng981712364

Thanks author for providing a good tensorflow yolov5 implementation, there's a mistake in the iou calculation which leads to incorrect iou loss and Nan loss value. I filed a PR #4 that can fix this issue.

vbvg2008 avatar May 06 '21 23:05 vbvg2008