models icon indicating copy to clipboard operation
models copied to clipboard

Memory issues in train.py - Exiting training at self._traceback = tf_stack.extract_stack()

Open TNemes-3141 opened this issue 5 years ago • 14 comments

System information

  • What is the top-level directory of the model you are using: tensorflow/models

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 64bit

  • TensorFlow installed from (source or binary): binary

  • TensorFlow version (use command below): gpu-1.15

  • Have I written custom code: Yes. I added these lines in order to get the training started at all:

from tensorflow import ConfigProto
from tensorflow import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
  • Bazel version: N/A
  • CUDA/cuDNN version: CUDA 10.0 / cuDNN v7.6.5
  • GPU model and memory: Nvidia GeForce GTX 1650 with 4GB dedicated memory
  • Python version: Python 3.6.8 64bit AMD64
  • Environment: Virtualenv with pip

Exact command to reproduce:

python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v2_coco.config

Describe the problem

After the training would get started with adding above lines to /models-master/research/object_detection/legacy/train.py, training got started on my GPU which has 4GB memory (therefore, reduced batch_size to 1). After approximately 1400 iterations however, training stops with wierd errors, which can be traced back to the following:

File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Initially, this is the root error which causes the crash:

(0) Invalid argument: Nan in summary histogram for:
ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance

Now I tried different solutions for this, but none of them worked. Full output and .config file seen below. Is the problem caused by my graphics card lacking memory or what else could be the problem?

Logs and errors:

Console output:

INFO:tensorflow:global step 1474: loss = 7.0426 (0.075 sec/step)
I0518 12:43:10.650701 12884 learning.py:507] global step 1474: loss = 7.0426 (0.075 sec/step)
INFO:tensorflow:global step 1475: loss = 7.8581 (0.076 sec/step)
I0518 12:43:10.726234 12884 learning.py:507] global step 1475: loss = 7.8581 (0.076 sec/step)
INFO:tensorflow:global step 1476: loss = 10.0487 (0.105 sec/step)
I0518 12:43:10.831919 12884 learning.py:507] global step 1476: loss = 10.0487 (0.105 sec/step)
INFO:tensorflow:Error reported to Coordinator: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
         [[FeatureExtractor/MobilenetV2/expanded_conv_1/project/weights/read/_63]]
  (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance':
  File "object_detection/legacy/train.py", line 191, in <module>
    tf.app.run()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "object_detection/legacy/train.py", line 187, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\Users\nemes\Documents\models-master\research\object_detection\legacy\trainer.py", line 355, in train
    model_var.op.name, model_var))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\summary\summary.py", line 179, in histogram
    tag=tag, values=values, name=scope)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\ops\gen_logging_ops.py", line 329, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()
Traceback (most recent call last):
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[{{node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance}}]]
         [[FeatureExtractor/MobilenetV2/expanded_conv_1/project/weights/read/_63]]
  (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[{{node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 297, in stop_on_exception
    yield
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495, in run
    self.run_loop()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1045, in run_loop
    [self._sv.summary_op, self._sv.global_step])
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
         [[FeatureExtractor/MobilenetV2/expanded_conv_1/project/weights/read/_63]]
  (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance':
  File "object_detection/legacy/train.py", line 191, in <module>
    tf.app.run()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "object_detection/legacy/train.py", line 187, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\Users\nemes\Documents\models-master\research\object_detection\legacy\trainer.py", line 355, in train
    model_var.op.name, model_var))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\summary\summary.py", line 179, in histogram
    tag=tag, values=values, name=scope)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\ops\gen_logging_ops.py", line 329, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

I0518 12:43:10.883035 11372 coordinator.py:219] Error reported to Coordinator: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
         [[FeatureExtractor/MobilenetV2/expanded_conv_1/project/weights/read/_63]]
  (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance':
  File "object_detection/legacy/train.py", line 191, in <module>
    tf.app.run()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "object_detection/legacy/train.py", line 187, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\Users\nemes\Documents\models-master\research\object_detection\legacy\trainer.py", line 355, in train
    model_var.op.name, model_var))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\summary\summary.py", line 179, in histogram
    tag=tag, values=values, name=scope)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\ops\gen_logging_ops.py", line 329, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()
Traceback (most recent call last):
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[{{node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance}}]]
         [[FeatureExtractor/MobilenetV2/expanded_conv_1/project/weights/read/_63]]
  (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[{{node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 297, in stop_on_exception
    yield
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495, in run
    self.run_loop()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1045, in run_loop
    [self._sv.summary_op, self._sv.global_step])
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
         [[FeatureExtractor/MobilenetV2/expanded_conv_1/project/weights/read/_63]]
  (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance':
  File "object_detection/legacy/train.py", line 191, in <module>
    tf.app.run()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "object_detection/legacy/train.py", line 187, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\Users\nemes\Documents\models-master\research\object_detection\legacy\trainer.py", line 355, in train
    model_var.op.name, model_var))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\summary\summary.py", line 179, in histogram
    tag=tag, values=values, name=scope)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\ops\gen_logging_ops.py", line 329, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

INFO:tensorflow:global step 1477: loss = 9.4083 (0.077 sec/step)
I0518 12:43:11.339294 12884 learning.py:507] global step 1477: loss = 9.4083 (0.077 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
I0518 12:43:11.352624 12884 learning.py:785] Finished training! Saving model to disk.
C:\tensorflow\lib\site-packages\tensorflow_core\python\summary\writer\writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened.
  warnings.warn("Attempting to use a closed FileWriter. "
Traceback (most recent call last):
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[{{node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance}}]]
         [[FeatureExtractor/MobilenetV2/expanded_conv_1/project/weights/read/_63]]
  (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[{{node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "object_detection/legacy/train.py", line 191, in <module>
    tf.app.run()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "object_detection/legacy/train.py", line 187, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\Users\nemes\Documents\models-master\research\object_detection\legacy\trainer.py", line 417, in train
    saver=saver)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\contrib\slim\python\slim\learning.py", line 790, in train
    ignore_live_threads=ignore_live_threads)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 839, in stop
    ignore_live_threads=ignore_live_threads)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "c:\users\nemes\appdata\local\programs\python\python36\lib\site-packages\six.py", line 703, in reraise
    raise value
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 297, in stop_on_exception
    yield
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\coordinator.py", line 495, in run
    self.run_loop()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\training\supervisor.py", line 1045, in run_loop
    [self._sv.summary_op, self._sv.global_step])
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
         [[FeatureExtractor/MobilenetV2/expanded_conv_1/project/weights/read/_63]]
  (1) Invalid argument: Nan in summary histogram for: ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance
         [[node ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance (defined at C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'ModelVars/FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/BatchNorm/moving_variance':
  File "object_detection/legacy/train.py", line 191, in <module>
    tf.app.run()
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\tensorflow\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "object_detection/legacy/train.py", line 187, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\Users\nemes\Documents\models-master\research\object_detection\legacy\trainer.py", line 355, in train
    model_var.op.name, model_var))
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\summary\summary.py", line 179, in histogram
    tag=tag, values=values, name=scope)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\ops\gen_logging_ops.py", line 329, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "C:\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

.config-file:

# SSD with Mobilenet v2 configuration for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    num_classes: 20
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 1
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "C:/Users/nemes/Documents/ssd_mobilenet_v2_coco/model.ckpt"
  fine_tune_checkpoint_type:  "detection"
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "C:/Users/nemes/Documents/data/train.record"
  }
  label_map_path: "C:/Users/nemes/Documents/data/label_map.pbtxt"
}

eval_config: {
  num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "C:/Users/nemes/Documents/data/eval.record"
  }
  label_map_path: "C:/Users/nemes/Documents/data/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

TNemes-3141 avatar May 18 '20 11:05 TNemes-3141