electra
electra copied to clipboard
NaN loss during training (again)
Hi, training SMALL model works fine but BASE model ends up with NaN loss. I tried decreasing learning rate to 1e-4 but it did not help (and it could not since the error happens during warmup phase, when learning rate is still very low). It can occur randomly after first couple of steps (even after first one). Please advise. Here is my training log:
26/1000000 = 0.0%, SPS: 0.4, ELAP: 1:05, ETA: 28 days, 21:44:53 - loss: 47.1119
27/1000000 = 0.0%, SPS: 0.4, ELAP: 1:06, ETA: 28 days, 10:18:42 - loss: 46.3502
28/1000000 = 0.0%, SPS: 0.4, ELAP: 1:08, ETA: 27 days, 23:43:51 - loss: 46.1481
29/1000000 = 0.0%, SPS: 0.4, ELAP: 1:09, ETA: 27 days, 13:46:58 - loss: 45.7326
30/1000000 = 0.0%, SPS: 0.4, ELAP: 1:10, ETA: 27 days, 4:30:34 - loss: 45.5664
31/1000000 = 0.0%, SPS: 0.4, ELAP: 1:12, ETA: 26 days, 19:48:18 - loss: 45.1209
32/1000000 = 0.0%, SPS: 0.4, ELAP: 1:13, ETA: 26 days, 11:41:27 - loss: 44.8707
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Error recorded from training_loop: NaN loss during training.
Traceback (most recent call last):
File "run_pretraining.py", line 385, in
I submitted the same issue before (#36), and I haven't found a solution. I think there may be a numerically unstable function in the code.
It still exists....
Did you use the openwebtext dataset or a custom one? @gchlodzinski