keras-yolo3 icon indicating copy to clipboard operation
keras-yolo3 copied to clipboard

multi-GPU training fails

Open Borda opened this issue 6 years ago • 2 comments

crashes with a similar error even on training head...

INFO:root:Train on 14626 samples, val on 1625 samples, with batch size 16.
Epoch 1/150
2019-10-11 23:42:30.041545: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
913/914 [============================>.] - ETA: 1s - loss: 27.99742019-10-12 00:00:47.026746: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_5484: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-12 00:00:47.027158: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1_5485: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-12 00:00:47.027194: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_2_5486: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
Traceback (most recent call last):
File "scripts/training.py", line 211, in <module>
_main(**arg_params)
File "scripts/training.py", line 182, in _main
callbacks=[tb_logging, checkpoint, reduce_lr, early_stopping])
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training_generator.py", line 234, in fit_generator
workers=0)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1472, in evaluate_generator
verbose=verbose)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training_generator.py", line 346, in evaluate_generator
outs = model.test_on_batch(x, y, sample_weight=sample_weight)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/engine/training.py", line 1256, in test_on_batch
outputs = self.test_function(ins)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in _call_
return self._call(inputs)
File "/home/j.borovec/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/home/j.borovec/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in _call_
run_metadata_ptr)
File "/home/j.borovec/.local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in _exit_
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: TensorArray replica_0/model_3/yolo_loss/TensorArray_5484: Could not read from TensorArray index 0. Furthermore, the element shape is not fully defined: [?,?,3]. It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written. If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
[\\{{node replica_0/model_3/yolo_loss/TensorArrayStack/TensorArrayGatherV3}}]
[\\{{node replica_1/model_3/yolo_loss/ExpandDims_3}}]

see https://github.com/qqwweee/keras-yolo3/issues/204, https://github.com/qqwweee/keras-yolo3/issues/497

Borda avatar Oct 21 '19 17:10 Borda

https://stackoverflow.com/questions/56813036/could-not-read-from-tensorarray-index-0-possible-you-are-working-with-resizeabl

Borda avatar Oct 21 '19 18:10 Borda

in ending an epoch fails with

2019-10-23 17:38:42.472726: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1689: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-23 17:38:42.473311: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_1_1690: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.
2019-10-23 17:38:42.473343: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:661 : Invalid argument: TensorArray replica_0/model_3/yolo_loss/TensorArray_2_1691: Could not read from TensorArray index 0.  Furthermore, the element shape is not fully defined: [?,?,3].  It is possible you are working with a resizeable TensorArray and stop_gradients is not allowing the gradients to be written.  If you set the full element_shape property on the forward TensorArray, the proper all-zeros tensor will be returned instead of incurring this error.

it is happening only with multi-GPU training

Borda avatar Oct 23 '19 15:10 Borda