NLDF
NLDF copied to clipboard
Crash while training
Hi,
I'm trying to train my own model according to your implementation. Since I encounter some gradient error cause Nan or Inf error.
Below is the log when training crash
2018-10-31 02:12:13.561786: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7fa89e206200 = {1, 0} Found Inf or NaN global norm.
Traceback (most recent call last):
File "TrainingModel.py", line 112, in <module>
model.label_holder: label_flip})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]
Caused by op u'VerifyFinite/CheckNumerics', defined at:
File "TrainingModel.py", line 42, in <module>
grads, _ = tf.clip_by_global_norm(tf.gradients(model.Loss_Mean, tvars), max_grad_norm)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/clip_ops.py", line 259, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/numerics.py", line 45, in verify_tensor_all_finite
verify_input = array_ops.check_numerics(t, message=msg)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]
Could you help to comment about it. I also wonder the tensorflow version or more detail for training environment of pre-build model.
Thanks.
I need to check what cause this NaN issue, and will reply you later
And any constraint for the training dataset
to train our own model ?
e.x. the target mask area must be larger than certain proportion to the label image?
I also meet this problem .I use dataset MSRA10K to train the model , but the loss become NAN at the third
epoch. Have you solved this problem?