ASFF icon indicating copy to clipboard operation
ASFF copied to clipboard

LOSS is NaN while training both baseline and ASFF, batchsize16 in 4 V100

Open kingthreestones opened this issue 2 years ago • 0 comments

Hello,I get trouble in training. The loss turned to “Nan”. I train the baseline and ASFF in 4 V100,the batchsize is 16 according to your paper. here is my command: python -m torch.distributed.launch --nproc_per_node=4 --master_port=10266 main.py --cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 4 --checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --log_dir log/COCO -s 608

the cfg: image

the tensorboard: image

the log: image

Please help me! Thank you!

kingthreestones avatar Jul 05 '22 02:07 kingthreestones