Error when training with Dice Coefficient
Hi,
I get an error when I tried training with dice coefficient as the ago function. I noticed there was a new commit on this a couple days ago so I suspect it's some bug in the code. Would you know roughly where this might be?
InvalidArgumentError Traceback (most recent call last)
/home/proj/tf_unet/tf_unet/unet.pyc in train(self, data_provider, output_path, training_iters, epochs, dropout, display_step, restore) 424 425 if step % display_step == 0: --> 426 self.output_minibatch_stats(sess, summary_writer, step, batch_x, util.crop_to_shape(batch_y, pred_shape)) 427 428 total_loss += loss
/home/proj/tf_unet/tf_unet/unet.pyc in output_minibatch_stats(self, sess, summary_writer, step, batch_x, batch_y) 467 feed_dict={self.net.x: batch_x, 468 self.net.y: batch_y, --> 469 self.net.keep_prob: 1.}) 470 summary_writer.add_summary(summary_str, step) 471 summary_writer.flush()
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata) 764 try: 765 result = self._run(None, fetches, feed_dict, options_ptr, --> 766 run_metadata_ptr) 767 if run_metadata: 768 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata) 962 if final_fetches or final_targets: 963 results = self._do_run(handle, final_targets, final_fetches, --> 964 feed_dict_string, options, run_metadata) 965 else: 966 results = []
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1012 if handle is None: 1013 return self._do_call(_run_fn, self._session, feed_dict, fetch_list, -> 1014 target_list, options, run_metadata) 1015 else: 1016 return self._do_call(_prun_fn, self._session, handle, feed_dict,
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args) 1032 except KeyError: 1033 pass -> 1034 raise type(e)(node_def, op, message) 1035 1036 def _extend_graph(self):
InvalidArgumentError: Nan in summary histogram for: norm_grads [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_37/read)]]
Caused by op u'norm_grads', defined at:
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/ipykernel/main.py", line 3, in
InvalidArgumentError (see above for traceback): Nan in summary histogram for: norm_grads [[Node: norm_grads = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](norm_grads/tag, Variable_37/read)]]
The error gets hit only after some number of iterations. It seems to get hit after fewer iterations I use the adam optimizer rather than the momentum, but that might just be for my specific case. After enough iterations, I get this error regardless of the optimizer I use. The same training/testing data works fine if I use cross entropy as the cost function.
Quick update. Found the issue. There is a bug in layers.py: In pixel_wise_softmax_2 and pixel_wise_softmax
If the output_map is too large, then exponential_map goes to infinity, which causes nan when calculating the cost function.
The following code fixes it, although we might want to find a better value to do the clipping: replace: exponential_map = tf.exp(output_map) with: exponential_map = tf.exp(tf.clip_by_value(output_map, -np.inf, 50))
BTW thanks for providing the tf_unet code. It has been very helpful! :)
Thanks for reporting this. I'm just wondering why the output_map gets so large
Yeah I'm wondering the same thing. I just noticed that I still get garbage results when training my data. (with cross entropy I was getting something more reasonable).
I have no idea why the output_map gets so large, I plan on looking into it some more a little later. Would you happen to have any ideas or theories to look into?
I have also encountered this issue. Using smaller learning rate helped. So maybe it's just an exploding gradient.
Maybe. Another thing I noticed was that to calculate the dice-coefficient, the original code is using both the channels together. When I use only one of the channels, the values I end up getting worked up to be better.
This is a typical issue of overflow/underflow when computing the sum (exp (x)) function. Search 'log sum exp' on the web will give some explanation. The trick is to divide/multiply the same constant before exp function.
Or you can use tf.reduce_logsumexp or refer to source code of this function.
@weiliu620 thanks for the hint. I'm going to look into this
@weiliu620 following the lines from here refered in your SO question we would just have to subtract the result of tf.reduce_max in the tf.exp call, right?