seq2seq-couplet icon indicating copy to clipboard operation
seq2seq-couplet copied to clipboard

跑到660900之后,报NaN错误

Open fjibj opened this issue 5 years ago • 3 comments

2019-08-06 00:52:38.476421: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7fb3e960d900 = {0, 1} Found Inf or NaN global norm.Traceback (most recent call last): File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had Inf values [[{{node VerifyFinite/CheckNumerics}} = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/r eplica:0/task:0/device:GPU:0"]] [[{{node clip_by_global_norm/mul_1/_301}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0 ", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2818_clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

fjibj avatar Aug 06 '19 05:08 fjibj

同问,有没有解决?

milk-bottle-liyu avatar Oct 11 '19 02:10 milk-bottle-liyu

我是跑到696100之后出现了同样的问题,有大神懂怎么解决吗?

2020-02-11 16:21:58.843161: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f2456e15a00 = {0, 1} Found Inf or NaN global norm.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had Inf values
	 [[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]
	 [[{{node clip_by_global_norm/mul_1/_159}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2818_clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

HavenTong avatar Feb 12 '20 03:02 HavenTong

我之前也偶尔会遇到同样的问题,一般解决办法就是从 checkpoint 继续训练。

wb14123 avatar May 05 '22 14:05 wb14123