ASFF
ASFF copied to clipboard
LOSS is NaN while training both baseline and ASFF, batchsize16 in 4 V100
Hello,I get trouble in training. The loss turned to “Nan”. I train the baseline and ASFF in 4 V100,the batchsize is 16 according to your paper. here is my command: python -m torch.distributed.launch --nproc_per_node=4 --master_port=10266 main.py --cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 4 --checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --log_dir log/COCO -s 608
the cfg:
the tensorboard:
the log:
Please help me! Thank you!