automl icon indicating copy to clipboard operation
automl copied to clipboard

NaN loss during training.

Open hongrui16 opened this issue 5 years ago • 6 comments

Use standard file utilities to get mtimes. ERROR:tensorflow:Model diverged with loss = NaN. E0513 16:02:25.483811 139845663950592 basic_session_run_hooks.py:760] Model diverged with loss = NaN. ERROR:tensorflow:Error recorded from training_loop: NaN loss during training. E0513 16:02:25.939540 139845663950592 error_handling.py:75] Error recorded from training_loop: NaN loss during training. WARNING:tensorflow:Reraising captured error W0513 16:02:25.939864 139845663950592 error_handling.py:135] Reraising captured error

i changed the following parameters h.learning_rate = 0.08 => 0.001 h.lr_warmup_init = 0.008 => 0.0001

it did not work

@fsx950223 i use the latest version of master branch.

hongrui16 avatar May 13 '20 16:05 hongrui16

Use standard file utilities to get mtimes. ERROR:tensorflow:Model diverged with loss = NaN. E0513 16:02:25.483811 139845663950592 basic_session_run_hooks.py:760] Model diverged with loss = NaN. ERROR:tensorflow:Error recorded from training_loop: NaN loss during training. E0513 16:02:25.939540 139845663950592 error_handling.py:75] Error recorded from training_loop: NaN loss during training. WARNING:tensorflow:Reraising captured error W0513 16:02:25.939864 139845663950592 error_handling.py:135] Reraising captured error

i changed the following parameters h.learning_rate = 0.08 => 0.001 h.lr_warmup_init = 0.008 => 0.0001

it did not work

@fsx950223 i use the latest version of master branch.

Estimator Dump Hook could help you.

fsx950223 avatar May 14 '20 01:05 fsx950223

Use standard file utilities to get mtimes. ERROR:tensorflow:Model diverged with loss = NaN. E0513 16:02:25.483811 139845663950592 basic_session_run_hooks.py:760] Model diverged with loss = NaN. ERROR:tensorflow:Error recorded from training_loop: NaN loss during training. E0513 16:02:25.939540 139845663950592 error_handling.py:75] Error recorded from training_loop: NaN loss during training. WARNING:tensorflow:Reraising captured error W0513 16:02:25.939864 139845663950592 error_handling.py:135] Reraising captured error i changed the following parameters h.learning_rate = 0.08 => 0.001 h.lr_warmup_init = 0.008 => 0.0001 it did not work @fsx950223 i use the latest version of master branch.

Estimator Dump Hook could help you.

Thank you. @fsx950223 what did you mean with ' Estimator Dump Hook could help you.'? could you explain more specifically. ps. i use tf 1.15

hongrui16 avatar May 14 '20 01:05 hongrui16

Reference tf1 debug tutorial.

fsx950223 avatar May 14 '20 01:05 fsx950223

Hi I met the same issue when training D2 on a custom dataset with batch_size==4, tried h.learning_rate = 0.08 => 0.001/0.01 h.lr_warmup_init = 0.008 => 0.0001/0.001 did not work. Have you solved it?

elv-xuwen avatar May 21 '20 22:05 elv-xuwen

Check your bounding box annotations to see if there are zero (or even negative) area boxes, which might have been created by mistake.

Several months ago, I encountered this NaN loss error when I used the tensorflow object detection API, and finally found that a few of my bounding boxes had zero area. This kind of error is hard to find using annotation tools like labelImg. My practice is to gather all xml annotations into csv file(s) (or DataFrame equivalently), and check by, for example, (df['area'].values>0).all()

wenh06 avatar May 24 '20 11:05 wenh06

Depending on your GPU, try changing mixed_precision: false in the config yaml file. This did it for me when using google colab which does not seem to have GPU computation capability > 7, and normally does not benefit computationally from mixed_precision. More info (https://www.tensorflow.org/guide/mixed_precision?hl=en)

landskris avatar May 03 '21 16:05 landskris