mobile-deeplab-v3-plus icon indicating copy to clipboard operation
mobile-deeplab-v3-plus copied to clipboard

NanLossDuringTrainingError: NaN loss during training.

Open wonchulSon opened this issue 3 years ago • 0 comments

INFO:tensorflow:Restoring parameters from /root/mobile-deeplab-v3-plus/datasets/people_segmentation/exp/deeplab-v3-plus/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /root/mobile-deeplab-v3-plus/datasets/people_segmentation/exp/deeplab-v3-plus/train/model.ckpt.
INFO:tensorflow:loss = 1.0444846, step = 0
INFO:tensorflow:cross_entropy = 0.6935922, learning_rate = 1e-04, total_loss = 1.0444846, train_mean_iou = 0.32585338, train_pixel_accuracy = 0.4993496
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "run.py", line 552, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "run.py", line 539, in main
    train()
  File "run.py", line 433, in train
    tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    return self
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/context.py", line 357, in _mode
    yield
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5229, in get_controller
    yield g
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5037, in get_controller
    yield default
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5229, in get_controller
    yield g
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/context.py", line 357, in _mode
    yield
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5229, in get_controller
    yield g
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 4247, in device
    yield
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1320, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

I am suffering to run this code using person segmentation dataset. Can't solve the 'NaN loss during training' error and I didn't edit any parameter. How to fix it?

wonchulSon avatar Apr 01 '21 06:04 wonchulSon