triplet-reid icon indicating copy to clipboard operation
triplet-reid copied to clipboard

training error: OutOfRangeError: End of sequence

Open muxizju opened this issue 6 years ago • 4 comments

I use the codes to train my own dataset, but raised this error at sees.run(). The detail printed log is as below in which I changed some args such as net_input_height size and batch_p. my tensorflow version is 1.7. I don't know what's wrong here

Instructions for updating: Use the retry module or similar alternatives. 2018-09-27 11:12:06,474 [INFO] train: Training using the following parameters: 2018-09-27 11:12:06,474 [INFO] train: batch_k: 4 2018-09-27 11:12:06,474 [INFO] train: batch_p: 8 2018-09-27 11:12:06,474 [INFO] train: checkpoint_frequency: 1000 2018-09-27 11:12:06,474 [INFO] train: crop_augment: False 2018-09-27 11:12:06,474 [INFO] train: decay_start_iteration: 100000 2018-09-27 11:12:06,474 [INFO] train: detailed_logs: False 2018-09-27 11:12:06,474 [INFO] train: embedding_dim: 128 2018-09-27 11:12:06,475 [INFO] train: experiment_root: F:/projector/GestureClassification/TripletBasedGestureRecognition/experiment_root/20180926/ 2018-09-27 11:12:06,475 [INFO] train: flip_augment: False 2018-09-27 11:12:06,475 [INFO] train: head_name: fc1024 2018-09-27 11:12:06,475 [INFO] train: image_root: F:/projector/GestureClassification/data/img/20180919/triplet_data/img/ 2018-09-27 11:12:06,475 [INFO] train: initial_checkpoint: None 2018-09-27 11:12:06,475 [INFO] train: learning_rate: 0.0003 2018-09-27 11:12:06,475 [INFO] train: loading_threads: 4 2018-09-27 11:12:06,475 [INFO] train: loss: batch_hard 2018-09-27 11:12:06,476 [INFO] train: margin: soft 2018-09-27 11:12:06,476 [INFO] train: metric: euclidean 2018-09-27 11:12:06,476 [INFO] train: model_name: resnet_v1_50 2018-09-27 11:12:06,476 [INFO] train: net_input_height: 64 2018-09-27 11:12:06,476 [INFO] train: net_input_width: 64 2018-09-27 11:12:06,476 [INFO] train: pre_crop_height: 64 2018-09-27 11:12:06,476 [INFO] train: pre_crop_width: 64 2018-09-27 11:12:06,476 [INFO] train: resume: False 2018-09-27 11:12:06,476 [INFO] train: train_iterations: 250000 2018-09-27 11:12:06,476 [INFO] train: train_set: F:/projector/GestureClassification/data/img/20180919/triplet_data/gesture_train.csv 2018-09-27 11:12:07,403 [INFO] tensorflow: Scale of 0 disables regularizer. 2018-09-27 11:12:07,403 [INFO] tensorflow: Scale of 0 disables regularizer. 2018-09-27 11:12:08,569 [WARNING] tensorflow: From F:\projector\GestureClassification\TripletBasedGestureRecognition\triplet-reid\nets\resnet_v1.py:219: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead 2018-09-27 11:12:08,569 [WARNING] tensorflow: From F:\projector\GestureClassification\TripletBasedGestureRecognition\triplet-reid\nets\resnet_v1.py:219: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\ops\gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " 2018-09-27 11:12:11.533610: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2018-09-27 11:12:11.936193: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 0 with properties: name: GeForce GTX 1060 5GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085 pciBusID: 0000:01:00.0 totalMemory: 5.00GiB freeMemory: 4.12GiB 2018-09-27 11:12:11.936710: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0 2018-09-27 11:12:14.388590: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-27 11:12:14.388811: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 2018-09-27 11:12:14.388948: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N 2018-09-27 11:12:14.415769: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3871 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 5GB, pci bus id: 0000:01:00.0, compute capability: 6.1) 2018-09-27 11:12:16.275624: I T:\src\github\tensorflow\tensorflow\core\kernels\cuda_solvers.cc:159] Creating CudaSolver handles for stream 000001A50E54E080 2018-09-27 11:12:20,572 [INFO] tensorflow: F:/projector/GestureClassification/TripletBasedGestureRecognition/experiment_root/20180926/checkpoint-0 is not in all_model_checkpoint_paths. Manually adding it. 2018-09-27 11:12:20,572 [INFO] tensorflow: F:/projector/GestureClassification/TripletBasedGestureRecognition/experiment_root/20180926/checkpoint-0 is not in all_model_checkpoint_paths. Manually adding it. 2018-09-27 11:12:23,207 [INFO] train: Starting training from iteration 0.

Traceback (most recent call last): File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call return fn(*args) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1312, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1420, in _call_tf_sessionrun status, run_metadata) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,64,64,3], [?], [?]], output_types=[DT_FLOAT, DT_STRING, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "F:/projector/GestureClassification/TripletBasedGestureRecognition/triplet-reid/train.py", line 439, in main() File "F:/projector/GestureClassification/TripletBasedGestureRecognition/triplet-reid/train.py", line 393, in main prec_at_k, endpoints['emb'], losses, fids]) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 905, in run run_metadata_ptr) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1140, in _run feed_dict_tensor, options, run_metadata) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run run_metadata) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,64,64,3], [?], [?]], output_types=[DT_FLOAT, DT_STRING, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'IteratorGetNext', defined at: File "F:/projector/GestureClassification/TripletBasedGestureRecognition/triplet-reid/train.py", line 439, in main() File "F:/projector/GestureClassification/TripletBasedGestureRecognition/triplet-reid/train.py", line 280, in main images, fids, pids = dataset.make_one_shot_iterator().get_next() File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 366, in get_next name=name)), self._output_types, File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 1484, in iterator_get_next output_shapes=output_shapes, name=name) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\framework\ops.py", line 3290, in create_op op_def=op_def) File "D:\Program Files\Python3.5\lib\site-packages\tensorflow\python\framework\ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): End of sequence [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,64,64,3], [?], [?]], output_types=[DT_FLOAT, DT_STRING, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Process finished with exit code 1

muxizju avatar Sep 27 '18 03:09 muxizju

I just found the reason. I have only 7 classes or persons in my dataset but I set batch_P as 8.

# Constrain the dataset size to a multiple of the batch-size, so that
# we don't get overlap at the end of each epoch.
dataset = dataset.take((len(unique_pids) // args.batch_p) * args.batch_p)

this step just take(0) as a result and the iteration of data will end at the first iteration then which raise the error mentioned.

It's a silly mistake but I suggest to add a if-else statement to notice this condition

muxizju avatar Sep 27 '18 04:09 muxizju

Thanks for updating with the reason. Indeed we could add code catching this mistake, I'd happily accept a PR doing so!

lucasb-eyer avatar Oct 28 '18 16:10 lucasb-eyer

you do a good job

duyanfang123 avatar Apr 09 '19 08:04 duyanfang123

@muxizju Just came across this as I also have few classes. My question is what happens with the rest of the classes if I say I have 7 classes and Batch_P is 4. What happens with the other 3 remainder classes. Do they get reiterated into the future batches or just ignored?

I just found the reason. I have only 7 classes or persons in my dataset but I set batch_P as 8.

# Constrain the dataset size to a multiple of the batch-size, so that
# we don't get overlap at the end of each epoch.
dataset = dataset.take((len(unique_pids) // args.batch_p) * args.batch_p)

this step just take(0) as a result and the iteration of data will end at the first iteration then which raise the error mentioned.

It's a silly mistake but I suggest to add a if-else statement to notice this condition

mazatov avatar Mar 12 '20 15:03 mazatov