tensorflow-deeplab-v3-plus
tensorflow-deeplab-v3-plus copied to clipboard
I can't not run the train.py
The following problems occur when I run the train.py file.
File "train.py", line 285, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train.py", line 267, in main
hooks=train_hooks,
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/algolab/HDD/LHHS/DeepLab_v3_plus/tensorflow-deeplab-v3-plus-master/deeplab_model.py", line 172, in deeplabv3_plus_model_fn
logits = network(features, mode == tf.estimator.ModeKeys.TRAIN)
File "/home/algolab/HDD/LHHS/DeepLab_v3_plus/tensorflow-deeplab-v3-plus-master/deeplab_model.py", line 129, in model
{v.name.split(':')[0]: v for v in variables_to_restore})
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/checkpoint_utils.py", line 187, in init_from_checkpoint
_init_from_checkpoint, ckpt_dir_or_file, assignment_map)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/distribute.py", line 1053, in merge_call
return self._merge_call(merge_fn, *args, **kwargs)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/distribute.py", line 1061, in _merge_call
return merge_fn(self._distribution_strategy, *args, **kwargs)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/checkpoint_utils.py", line 194, in _init_from_checkpoint
reader = load_checkpoint(ckpt_dir_or_file)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/checkpoint_utils.py", line 64, in load_checkpoint
return pywrap_tensorflow.NewCheckpointReader(filename)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 326, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern), status)
File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for PRE_TRAINED_MODEL
I have resnet_v2_101.ckpt in the following path. Why does this happen?
parser.add_argument('--pre_trained_model', type=str, default='./ini_checkpoints/resnet_v2_101/resnet_v2_101.ckpt', help='Path to the pre-trained model checkpoint.')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 285, in
[[{{node gradients/DynamicPartition_grad/range/_7773}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_35384_gradients/DynamicPartition_grad/range", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'resnet_v2_101/block3/unit_1/bottleneck_v2/conv3/Conv2D', defined at:
File "train.py", line 285, in
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10,1024,33,33] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node resnet_v2_101/block3/unit_1/bottleneck_v2/conv3/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](resnet_v2_101/block3/unit_1/bottleneck_v2/conv2/Relu, resnet_v2_101/block3/unit_1/bottleneck_v2/conv3/weights/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node gradients/DynamicPartition_grad/range/_7773}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_35384_gradients/DynamicPartition_grad/range", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
decrease batchsize, try 1 @juanmanuelrq
@sori0528 specify pre_trained_model: python train.py --pre_trained_model ./ini_checkpoints/resnet_v2_101/resnet_v2_101.ckpt and also tf.logging.info( pre_trained_model), before tf.train.init_from_checkpoint
@northeastsquare hi I am trying to run this code and getting system exit error while running create_pascal_tf_record.py ---------------------------------------------------------------------------
SystemExit Traceback (most recent call last)
and I guess while running train.py, the error occurs at that line in main function. can you please what can be done?
decrease batchsize, try 1 @juanmanuelrq
Great! I meet the same problem. Solved!