finetune_alexnet_with_tensorflow icon indicating copy to clipboard operation
finetune_alexnet_with_tensorflow copied to clipboard

Fine-Tuning Fails With Exception Between Epoch1 and Epoch2

Open shashankiyer opened this issue 5 years ago • 1 comments

I have been trying to use this code to fine-tune the network to classify images from the Cifar10 dataset. However, I get the following error:

Traceback (most recent call last): File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: fc6/weights_0 [[{{node fc6/weights_0}} = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](fc6/weights_0/tag, fc6/weights/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "finetune.py", line 202, in keep_prob: 1.}) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: fc6/weights_0 [[node fc6/weights_0 (defined at finetune.py:137) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](fc6/weights_0/tag, fc6/weights/read)]]

Caused by op 'fc6/weights_0', defined at: File "finetune.py", line 137, in tf.summary.histogram(var.name, var) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/summary/summary.py", line 187, in histogram tag=tag, values=values, name=scope) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 284, in histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/shashankiyer/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: fc6/weights_0 [[node fc6/weights_0 (defined at finetune.py:137) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](fc6/weights_0/tag, fc6/weights/read)]]

These are lines in the code that cause this:

//Add gradients to summary for gradient, var in grads_and_vars: tf.summary.histogram(var.name + '/gradient', gradient)

//Add the variables we train to the summary for var in var_list: tf.summary.histogram(var.name, var)

I am running Tensorflow 1.12.0 Any pointers will be greatly appreciated.

shashankiyer avatar Nov 27 '18 01:11 shashankiyer

NaN values are almost always a hint that your learning rate ist to high. Try to decrease ist to e.g. 1e-3 or 1e-4

kratzert avatar Mar 07 '19 09:03 kratzert