BERT-BiLSTM-CRF-NER icon indicating copy to clipboard operation
BERT-BiLSTM-CRF-NER copied to clipboard

Found Inf or NaN global norm

Open hanyaqian opened this issue 5 years ago • 7 comments

总是会遇到Found Inf or NaN global norm,要怎么办呢?

123 INFO:tensorflow:Saving checkpoints for 0 into ./output/result_dir/model.ckpt.
124 2019-04-01 11:26:15.232850: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f1ba962460    0 = {1, 0} Found Inf or NaN global norm.
125 INFO:tensorflow:Error recorded from training_loop: Found Inf or NaN global norm. : Tensor had NaN values
126    [[node VerifyFinite/CheckNumerics (defined at /disk1/hanyaqian/code/work15_bert_cpr/youdao_cpr/bert/optimization.p    y:74)  = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/    device:GPU:0"](global_norm/global_norm)]]
127 
128 Caused by op u'VerifyFinite/CheckNumerics', defined at:
129   File "run_classifier_cpr.py", line 785, in <module>
130     tf.app.run()
131   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/platform/app.py", li    ne 125, in run
132     _sys.exit(main(argv))
133   File "run_classifier_cpr.py", line 712, in main
134     estimator.train(input_fn=train_input_fn, max_steps=next_checkpoint)
135   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_    estimator.py", line 2403, in train
136     saving_listeners=saving_listeners
137   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.    py", line 354, in train
138     loss = self._train_model(input_fn, hooks, saving_listeners)
139   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.    py", line 1207, in _train_model
140     return self._train_model_default(input_fn, hooks, saving_listeners)
141   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.    py", line 1237, in _train_model_default
142     features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
143   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_    estimator.py", line 2195, in _call_model_fn
144     features, labels, mode, config)
145   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.    py", line 1195, in _call_model_fn
146     model_fn_results = self._model_fn(features=features, **kwargs)
 NOR

hanyaqian avatar Apr 01 '19 03:04 hanyaqian

您这个应该不是直接运行的我的代码吧,改动的地方也不清楚。没办法看出来是什么问题。

macanv avatar Apr 01 '19 05:04 macanv

one more thing,2.7环境没测试过。

macanv avatar Apr 01 '19 05:04 macanv

我也碰到了同样的问题(虽然不是同一个程序),我正在用tfdbg调试,能帮助查找程序中出现的nan值,后来发现是自己之前没注意的一个地方存在0除以0导致了nan的出现。希望对你有帮助。

njusq avatar Apr 14 '19 13:04 njusq

我在tensorflow1.9版本运行正常,但是在tensorflow1.13版本运行,一直显示Found Inf or NaN global norm,除了更改了文件路径,其他代码并未做更改,好奇怪??????显示(gras,_)=tf.clip_by_global_norm(grads,clip=1.0)这行错误,我调整了learning_rate还是不行

geibeile avatar Mar 18 '20 12:03 geibeile

我在tensorflow1.9版本运行正常,但是在tensorflow1.13版本运行,一直显示Found Inf or NaN global norm,除了更改了文件路径,其他代码并未做更改,好奇怪??????显示(gras,_)=tf.clip_by_global_norm(grads,clip=1.0)这行错误,我调整了learning_rate还是不行

请问您解决这个问题了吗?我在做其他任务的时候也遇到因为更换tf版本导致在这部出现了nan

HqWu-HITCS avatar Apr 23 '20 09:04 HqWu-HITCS

我也没解决,这个是tensorflow版本问题导致的,建议换成pytorch版本,这个兼容性好点,希望对你有帮助

------------------ 原始邮件 ------------------ 发件人: "hqWu"<[email protected]>; 发送时间: 2020年4月23日(星期四) 下午5:33 收件人: "macanv/BERT-BiLSTM-CRF-NER"<[email protected]>; 抄送: "安静倾诉馨雨"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [macanv/BERT-BiLSTM-CRF-NER] Found Inf or NaN global norm (#100)

我在tensorflow1.9版本运行正常,但是在tensorflow1.13版本运行,一直显示Found Inf or NaN global norm,除了更改了文件路径,其他代码并未做更改,好奇怪??????显示(gras,_)=tf.clip_by_global_norm(grads,clip=1.0)这行错误,我调整了learning_rate还是不行

请问您解决这个问题了吗?我在做其他任务的时候也遇到因为更换tf版本导致在这部出现了nan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

geibeile avatar Apr 23 '20 09:04 geibeile

我也没解决,这个是tensorflow版本问题导致的,建议换成pytorch版本,这个兼容性好点,希望对你有帮助 ------------------ 原始邮件 ------------------ 发件人: "hqWu"<[email protected]>; 发送时间: 2020年4月23日(星期四) 下午5:33 收件人: "macanv/BERT-BiLSTM-CRF-NER"<[email protected]>; 抄送: "安静倾诉馨雨"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [macanv/BERT-BiLSTM-CRF-NER] Found Inf or NaN global norm (#100) 我在tensorflow1.9版本运行正常,但是在tensorflow1.13版本运行,一直显示Found Inf or NaN global norm,除了更改了文件路径,其他代码并未做更改,好奇怪??????显示(gras,_)=tf.clip_by_global_norm(grads,clip=1.0)这行错误,我调整了learning_rate还是不行 请问您解决这个问题了吗?我在做其他任务的时候也遇到因为更换tf版本导致在这部出现了nan — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

收到,多谢

HqWu-HITCS avatar Apr 23 '20 09:04 HqWu-HITCS