tensorflow-yolov3 icon indicating copy to clipboard operation
tensorflow-yolov3 copied to clipboard

Train loss: nan Test loss: nan Saving

Open juanmanuelrq opened this issue 4 years ago • 8 comments

Hi,

Hi, I was training and.... nan..nan,

` => Epoch: 977 Time: 2020-03-03 10:50:58 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ... 0it [00:00, ?it/s] => Epoch: 978 Time: 2020-03-03 10:51:11 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ... 0it [00:00, ?it/s] => Epoch: 979 Time: 2020-03-03 10:51:30 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ... 0it [00:00, ?it/s] => Epoch: 980 Time: 2020-03-03 10:51:48 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ... 0it [00:00, ?it/s] => Epoch: 981 Time: 2020-03-03 10:52:03 Train loss: nan Test loss: nan Saving ./checkpoint/yolov3_test_loss=nan.ckpt ...

my config.py file

#! /usr/bin/env python

coding=utf-8

#================================================================

Copyright (C) 2019 * Ltd. All rights reserved.

Editor : VIM

File name : config.py

Author : YunYang1994

Created date: 2019-02-28 13:06:54

Description :

#================================================================

from easydict import EasyDict as edict

__C = edict()

Consumers can get config by: from config import cfg

cfg = __C

YOLO options

__C.YOLO = edict()

Set the class name

__C.YOLO.CLASSES = "./data/classes/class.names" __C.YOLO.ANCHORS = "./data/anchors/basline_anchors.txt" __C.YOLO.MOVING_AVE_DECAY = 0.9995 __C.YOLO.STRIDES = [8, 16, 32] __C.YOLO.ANCHOR_PER_SCALE = 3 __C.YOLO.IOU_LOSS_THRESH = 0.5 __C.YOLO.UPSAMPLE_METHOD = "resize" __C.YOLO.ORIGINAL_WEIGHT = "./checkpoint/yolov3_coco.ckpt" __C.YOLO.DEMO_WEIGHT = "./checkpoint/yolov3_coco_demo.ckpt"

Train options

__C.TRAIN = edict()

__C.TRAIN.ANNOT_PATH = "./data/dataset/visdrone_train.txt" __C.TRAIN.BATCH_SIZE = 6 __C.TRAIN.INPUT_SIZE = [320, 352, 384, 416, 448, 480, 512, 544, 576, 608] __C.TRAIN.DATA_AUG = True __C.TRAIN.LEARN_RATE_INIT = 1e-4 __C.TRAIN.LEARN_RATE_END = 1e-6 __C.TRAIN.WARMUP_EPOCHS = 2 __C.TRAIN.FISRT_STAGE_EPOCHS = 20 __C.TRAIN.SECOND_STAGE_EPOCHS = 20000 __C.TRAIN.INITIAL_WEIGHT = "./checkpoint/yolov3_coco_demo.ckpt"

TEST options

__C.TEST = edict()

__C.TEST.ANNOT_PATH = "./data/dataset/visdrone_test.txt" __C.TEST.BATCH_SIZE = 2 __C.TEST.INPUT_SIZE = 544 __C.TEST.DATA_AUG = False __C.TEST.WRITE_IMAGE = True __C.TEST.WRITE_IMAGE_PATH = "./data/detection/" __C.TEST.WRITE_IMAGE_SHOW_LABEL = True __C.TEST.WEIGHT_FILE = "./checkpoint/yolov3_test_loss=9.2099.ckpt-5" __C.TEST.SHOW_LABEL = True __C.TEST.SCORE_THRESHOLD = 0.3 __C.TEST.IOU_THRESHOLD = 0.45

`

juanmanuelrq avatar Mar 03 '20 15:03 juanmanuelrq

maybe you can reduce the learn_rate first, if it doesn't work, try to look for errors in your code and datasets?

qncsn2016 avatar Mar 16 '20 13:03 qncsn2016

@juanmanuelrq Have you solved this problem? I'm training VOC dataset, I got test loss = NAN, but train loss equals to sth. reasonable.

llmpass avatar Mar 20 '20 06:03 llmpass

This indicates that you have a problem with train txt file what format are using ? it should be Filepath x1,y1,x2,y2 no headers @llmpass @juanmanuelrq

Theriyadh avatar Apr 09 '20 05:04 Theriyadh

This indicates that you have a problem with train txt file what format are using ? it should be Filepath x1,y1,x2,y2 no headers @llmpass @juanmanuelrq

Train loss: nan Test loss: nan,This happened to me at the beginning of training,but the format of train.txt is same as you said

MC1016 avatar Jul 22 '20 02:07 MC1016

@juanmanuelrq Have you resolved the issue?

I have the same problem.

all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.2686-nan.ckpt-1" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.2071-nan.ckpt-2" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.1809-nan.ckpt-3" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.1537-nan.ckpt-4" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.1885-nan.ckpt-5" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=6.1779-nan.ckpt-6" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-7" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-8" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-9" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-10" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-11" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-12" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-13" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-14" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-15" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-16" all_model_checkpoint_paths: "Pedestrian_yolov3_loss=nan-nan.ckpt-17"

I am training one class and dataset is about 7000 images.

MuhammadAsadJaved avatar Aug 07 '20 03:08 MuhammadAsadJaved

I read the following issues and solved the problem https://github.com/YunYang1994/tensorflow-yolov3/issues/294 https://github.com/YunYang1994/tensorflow-yolov3/issues/350 https://github.com/YunYang1994/tensorflow-yolov3/issues/170 https://github.com/YunYang1994/tensorflow-yolov3/issues/149

qncsn2016 avatar Aug 07 '20 04:08 qncsn2016

@qncsn2016 Thank you so much.

MuhammadAsadJaved avatar Aug 07 '20 07:08 MuhammadAsadJaved

@juanmanuelrq Have you solved this problem? I'm training VOC dataset, I got test loss = NAN, but train loss equals to sth. reasonable.

hello,I ran into the same problem and wanted to reinitialize the VOC dataset instead of training on the basis of Coco's pre-training weights.Test_loss =nan was the first epoch when I retrained VOC. How did you solve the problem?Thank you very much

yjsdut avatar Dec 01 '20 02:12 yjsdut