models icon indicating copy to clipboard operation
models copied to clipboard

EfficientDet Training loss=nan

Open emitoooo opened this issue 5 years ago • 24 comments

I'm training my custom model with EfficientDet D0 but after 700 steps I am getting loss as nan value. Is there someone who has the same problem? TensorFlow 2.3.0 with GTX 1060 10.1 CUDA

Here my training overview: image

I am using default config file parameters. Image size 512x512 and batch_size 1

emitoooo avatar Aug 25 '20 17:08 emitoooo

I have the same problem with EfficientDet D4. D4

TolgaBkm avatar Aug 27 '20 08:08 TolgaBkm

One of my viewers had the same problem. Changing the batch size from 1 to 4 solved the problem

TannerGilbert avatar Aug 27 '20 13:08 TannerGilbert

I also have the same problem. I've tried d1 and d2. image Also, it seems that network is not learning at all.. I have a custom .tfrecord dataset. The config file is attached.

pipeline_d2.config.txt

koolvn avatar Aug 28 '20 05:08 koolvn

I have also tried to increase the batch size to 4. However, I still got loss=NaN around 50k steps of training.

TolgaBkm avatar Aug 28 '20 11:08 TolgaBkm

Try to reduce the learning rate, this should solve the problem. The models are usually trained with a large batchsize, eg.: 128 for efficient_det d3, when adjusting this parameter (eg: 128 -> 4) makes sure to also change the learning rate, since your gradients are a lot noisier. I would divide the learning rate by the same factor.

geeheim avatar Aug 28 '20 12:08 geeheim

Try to reduce the learning rate, this should solve the problem. The models are usually trained with a large batchsize, eg.: 128 for efficient_det d3, when adjusting this parameter (eg: 128 -> 4) makes sure to also change the learning rate, since your gradients are a lot noisier. I would divide the learning rate by the same factor.

@greeheim Thank you for an advice! I've never thought that learning rate could cause a problem like that. Now I'm training EfficientDet D1 on GTX 1080 Ti with batch size = 6 and LR = 0.003 with warmup. It looks promising. image

Here's my pipeline config in case anyone ever needs it.

koolvn avatar Aug 28 '20 15:08 koolvn

Try to reduce the learning rate, this should solve the problem. The models are usually trained with a large batchsize, eg.: 128 for efficient_det d3, when adjusting this parameter (eg: 128 -> 4) makes sure to also change the learning rate, since your gradients are a lot noisier. I would divide the learning rate by the same factor.

@greeheim Thank you for an advice! I've never thought that learning rate could cause a problem like that. Now I'm training EfficientDet D1 on GTX 1080 Ti with batch size = 6 and LR = 0.003 with warmup. It looks promising. image

Here's my pipeline config in case anyone ever needs it. I am getting a similar issue right now tried to reduce the learning rate of the model and training seems working well. Those blue lines are on tensorboard logs record, is it testing metrics? how do i get it while training? I am only able to get training metrics am using COCO metrics. also on evaluation am able to get only single point after training. how to get this graph

aafaqin avatar Oct 12 '20 11:10 aafaqin

@aafaqin If you're using tf2 then you should run a separate evaluation process to get those blue lines in tensorboard. Use smth like this

# eval
python models/research/object_detection/model_main_tf2.py --pipeline_config_path=$PIPELINE_CONFIG --model_dir=$MODEL_DIR --checkpoint_dir=$MODEL_DIR --alsologtostderr

koolvn avatar Oct 14 '20 10:10 koolvn

@aafaqin If you're using tf2 then you should run a separate evaluation process to get those blue lines in tensorboard. Use smth like this

# eval
python models/research/object_detection/model_main_tf2.py --pipeline_config_path=$PIPELINE_CONFIG --model_dir=$MODEL_DIR --checkpoint_dir=$MODEL_DIR --alsologtostderr

@koolvn I am still not able to see those blue lines....is it after some steps or epochs or so ? also for tensorboard which logdir to use ? /train or /eval ....I can see 2 logs dirs.

adityap27 avatar Nov 23 '20 04:11 adityap27

@adityap27 are you running evaluation task in parallel with training? Yes, it'll make evaluate every 1000 steps as far as I can remember Also you have to specify evaluation dataset in your config file image

koolvn avatar Nov 23 '20 06:11 koolvn

@aafaqin are you running evaluation task in parallel with training? Yes, it'll make evaluate every 1000 steps as far as I can remember Also you have to specify evaluation dataset in your config file image

My config file matches with you. Also same with me. Evaluates after 1000 steps. I ran after training finishes. (serial manner) I will try parallel. Also currently with logdir as /train I get train loss in Tensorboard and with /eval I get only eval mAP, I can't get eval loss. So what logdir you used with tensorboard ? /train ?

adityap27 avatar Nov 23 '20 06:11 adityap27

@adityap27 You have to run eval task in parallel (I also tried to run eval task after the training is finished but it gave me results only for the last step) Actually I use $MODEL_DIR as a --logdir in tensorboard The /eval and /train are subfolders in $MODEL_DIR

koolvn avatar Nov 23 '20 07:11 koolvn

val and /train are subf

Thanks @koolvn , Running eval job in parallel and using $MODEL_DIR as a --logdir in tensorboard. These both worked for me to get nice graphs..!!

adityap27 avatar Nov 23 '20 11:11 adityap27

@koolvn what was your size of the dataset. I am training my model in collab and getting a loss of 0.036 for the same configuration that is mentioned above and for 5000 steps. But it seems that when I am visualizing the output it's not able to detect and also draw the bounding box. Any help??

Rishav-hub avatar Dec 28 '20 09:12 Rishav-hub

@Rishav-hub sorry, I can't remember what was the size of my dataset. Probably five thousand steps is not enough. Try training for 50k steps. Also use validation dataset to monitor metrics, especially recall.

koolvn avatar Dec 28 '20 10:12 koolvn

@Rishav-hub sorry, I can't remember what was the size of my dataset. Probably five thousand steps is not enough. Try training for 50k steps. Also use validation dataset to monitor metrics, especially recall.

But the loss is not decreasing after 8K steps also it remains between 0.04 to 0.03.

Rishav-hub avatar Dec 28 '20 10:12 Rishav-hub

I'm trying to train the EfficientDet d0 512x512, but it doesn't seem to converge.

batch_size: 4 learning_rate_base: 0.00249999994412064 total_steps: 300000 warmup_learning_rate: 0.0010000000474974513 warmup_steps: 2500

image

any tips?

dademiller360 avatar Feb 08 '21 07:02 dademiller360

@dademiller360 sims like learning rate is too high for your batch size. Try dividing LR by 10 or 100

koolvn avatar Feb 08 '21 07:02 koolvn

@dademiller360 sims like learning rate is too high for your batch size. Try dividing LR by 10 or 100

Thank you koolvn Dividing LR by 10 seems to solve it image

dademiller360 avatar Feb 12 '21 09:02 dademiller360

Try to reduce the learning rate, this should solve the problem. The models are usually trained with a large batchsize, eg.: 128 for efficient_det d3, when adjusting this parameter (eg: 128 -> 4) makes sure to also change the learning rate, since your gradients are a lot noisier. I would divide the learning rate by the same factor.

How do you know, that 128 is optimal for d3? How much images for d0? I have only 28 examples in my whole dataset, using batch_size=1 and learning_rate=0.001. The loss jumps a lot and doesn't converge at all

BogoK avatar Apr 24 '21 20:04 BogoK

How do you know, that 128 is optimal for d3? How much images for d0? I have only 28 examples in my whole dataset, using batch_size=1 and learning_rate=0.001. The loss jumps a lot and doesn't converge at all

Usually we want to have the batch size as big as possible. In your case loss jumps and algorithm doesn't converge because you have a very big learning rate with respect to your small batch size. Try a smaller LR, something like 1e-6 or even less. The other way is to increse the batch size and augment your data.

Here's a good explanation

koolvn avatar Apr 25 '21 09:04 koolvn

How do you know, that 128 is optimal for d3? How much images for d0? I have only 28 examples in my whole dataset, using batch_size=1 and learning_rate=0.001. The loss jumps a lot and doesn't converge at all

Usually we want to have the batch size as big as possible. In your case loss jumps and algorithm doesn't converge because you have a very big learning rate with respect to your small batch size. Try a smaller LR, something like 1e-6 or even less. The other way is to increse the batch size and augment your data.

Here's a good explanation

Thank you. If I go lover than 0.001 my mAP is 0 for each of tested learning rates. Which effect does the "steps" parameter has on the training?

BogoK avatar Apr 25 '21 18:04 BogoK

Same problem and solved when LR is reduced to 1e-5. My batch_size is 4 and it trains on 4 GPUs.

lihe07 avatar Jan 02 '22 12:01 lihe07

Check this tutorial:: link

jidd20118 avatar Jun 23 '24 18:06 jidd20118