electra icon indicating copy to clipboard operation
electra copied to clipboard

NaN loss during training (again)

Open gchlodzinski opened this issue 4 years ago • 3 comments

Hi, training SMALL model works fine but BASE model ends up with NaN loss. I tried decreasing learning rate to 1e-4 but it did not help (and it could not since the error happens during warmup phase, when learning rate is still very low). It can occur randomly after first couple of steps (even after first one). Please advise. Here is my training log:

26/1000000 = 0.0%, SPS: 0.4, ELAP: 1:05, ETA: 28 days, 21:44:53 - loss: 47.1119 27/1000000 = 0.0%, SPS: 0.4, ELAP: 1:06, ETA: 28 days, 10:18:42 - loss: 46.3502 28/1000000 = 0.0%, SPS: 0.4, ELAP: 1:08, ETA: 27 days, 23:43:51 - loss: 46.1481 29/1000000 = 0.0%, SPS: 0.4, ELAP: 1:09, ETA: 27 days, 13:46:58 - loss: 45.7326 30/1000000 = 0.0%, SPS: 0.4, ELAP: 1:10, ETA: 27 days, 4:30:34 - loss: 45.5664 31/1000000 = 0.0%, SPS: 0.4, ELAP: 1:12, ETA: 26 days, 19:48:18 - loss: 45.1209 32/1000000 = 0.0%, SPS: 0.4, ELAP: 1:13, ETA: 26 days, 11:41:27 - loss: 44.8707 ERROR:tensorflow:Model diverged with loss = NaN. ERROR:tensorflow:Error recorded from training_loop: NaN loss during training. Traceback (most recent call last): File "run_pretraining.py", line 385, in main() File "run_pretraining.py", line 381, in main args.model_name, args.data_dir, **hparams)) File "run_pretraining.py", line 344, in train_or_eval max_steps=config.num_train_steps) File "/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train rendezvous.raise_errors() File "/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors six.reraise(typ, value, traceback) File "/six.py", line 703, in reraise raise value File "/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train saving_listeners=saving_listeners) File "/tensorflow_estimator/python/estimator/estimator.py", line 370, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default saving_listeners) File "/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/tensorflow_core/python/training/monitored_session.py", line 754, in run run_metadata=run_metadata) File "/tensorflow_core/python/training/monitored_session.py", line 1259, in run run_metadata=run_metadata) File "/tensorflow_core/python/training/monitored_session.py", line 1360, in run raise six.reraise(*original_exc_info) File "/six.py", line 703, in reraise raise value File "/tensorflow_core/python/training/monitored_session.py", line 1345, in run return self._sess.run(*args, **kwargs) File "/tensorflow_core/python/training/monitored_session.py", line 1426, in run run_metadata=run_metadata)) File "/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run raise NanLossDuringTrainingError tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

gchlodzinski avatar Jun 11 '20 19:06 gchlodzinski

I submitted the same issue before (#36), and I haven't found a solution. I think there may be a numerically unstable function in the code.

tomohideshibata avatar Jun 12 '20 15:06 tomohideshibata

It still exists....

snowood1 avatar May 03 '21 07:05 snowood1

Did you use the openwebtext dataset or a custom one? @gchlodzinski

RuanVisser avatar Sep 15 '22 12:09 RuanVisser