FastMaskRCNN icon indicating copy to clipboard operation
FastMaskRCNN copied to clipboard

python train/train.py error

Open ranwek opened this issue 7 years ago • 11 comments

Restored 267(640) vars from ./data/pretrained_models/resnet_v1_50.ckpt 2017-10-19 21:09:57.871541: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function 2017-10-19 21:09:57.894972: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function [[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]] 2017-10-19 21:09:57.894972: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function [[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]] 2017-10-19 21:09:57.894972: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function [[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]] Traceback (most recent call last): File "train/train.py", line 339, in train() File "train/train.py", line 271, in train [input_image] + [final_box] + [final_cls] + [final_prob] + [final_gt_cls] + [gt] + [tmp_0] + [tmp_1] + [tmp_2] + [tmp_3] + [tmp_4]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function [[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: pyramid_2/OneHotEncoding_4/one_hot/_1255 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9981_pyramid_2/OneHotEncoding_4/one_hot", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op u'pyramid_1/AssignGTBoxes/Where_6', defined at: File "train/train.py", line 339, in train() File "train/train.py", line 193, in train loss_weights=[0.2, 0.2, 1.0, 0.2, 1.0]) File "train/../libs/nets/pyramid_network.py", line 580, in build is_training=is_training, gt_boxes=gt_boxes) File "train/../libs/nets/pyramid_network.py", line 263, in build_heads assign_boxes(rois, [rois, batch_inds], [2, 3, 4, 5]) File "train/../libs/layers/wrapper.py", line 172, in assign_boxes inds = tf.where(tf.equal(assigned_layers, l)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 2439, in where return gen_array_ops.where(input=condition, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 5930, in where "Where", input=input, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function [[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: pyramid_2/OneHotEncoding_4/one_hot/_1255 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9981_pyramid_2/OneHotEncoding_4/one_hot", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

ranwek avatar Oct 19 '17 13:10 ranwek

I can't figure out what is wrong with it,ask for help

ranwek avatar Oct 19 '17 13:10 ranwek

I had the same issue. It seems that cuda has some problems: if (first_success != cudaSuccess) { return errors::Internal( "WhereOp: Could not launch cub::DeviceReduce::Sum to calculate " "temp_storage_bytes, status: ", cudaGetErrorString(first_success)); } To solve it I downgraded the tf version from 1.3 to 1.1 #88 and changed cudnn version to 5.1

ahmedtalbi avatar Oct 20 '17 11:10 ahmedtalbi

Have you solved the problem?I also met!!

smileflank avatar Dec 14 '17 06:12 smileflank

i have same problem...did anybody manage to fix it... i downgraded my tensorflow to 1.1 but i didn't help

eliethesaiyan avatar Jan 25 '18 07:01 eliethesaiyan

did you change the cudnn version?

ahmedtalbi avatar Jan 25 '18 08:01 ahmedtalbi

I am also getting same error.

Caused by op u'pyramid_1/AssignGTBoxes/Where_7', defined at: File "train/train.py", line 339, in train() File "train/train.py", line 193, in train loss_weights=[0.2, 0.2, 1.0, 0.2, 1.0]) File "train/../libs/nets/pyramid_network.py", line 580, in build is_training=is_training, gt_boxes=gt_boxes) File "train/../libs/nets/pyramid_network.py", line 263, in build_heads assign_boxes(rois, [rois, batch_inds], [2, 3, 4, 5]) File "train/../libs/layers/wrapper.py", line 172, in assign_boxes inds = tf.where(tf.equal(assigned_layers, l)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 2365, in where return gen_array_ops.where(input=condition, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 4053, in where result = _op_def_lib.apply_op("Where", input=input, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function [[Node: pyramid_1/AssignGTBoxes/Where_7 = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] [[Node: pyramid_1/fully_connected_3/BiasAdd/_2901 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_27365_pyramid_1/fully_connected_3/BiasAdd", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

How to solve this error

satya2550 avatar Feb 07 '18 13:02 satya2550

@ranwek @ahmedtalbi @smileflank @eliethesaiyan @satya2550 I also meet this problem, do you fix it?

Superlee506 avatar May 13 '18 05:05 Superlee506

Is it necessary to downgrade the tensorflow version to solve the problem? Any other way?

doublemanyu avatar Jul 10 '18 03:07 doublemanyu

To solve this issue , i trained mask-rcnn with python 3.6 , cuda 9.0 ,cudnn 7.0, tensorflow 1.8 , opencv 3.4.

ranwek avatar Jul 10 '18 06:07 ranwek

"mkdir /data/coco" and put all downloaded data into it,then "python download_and_convert_data.py",you should see this: ...

None Annotations data/coco/train2014/COCO_train2014_000000284128.jpg

Converting image 2151/82783 shard 0 Converting image 2201/82783 shard 0 Converting image 2251/82783 shard 0 None Annotations data/coco/train2014/COCO_train2014_000000167118.jpg Converting image 2301/82783 shard 0 None Annotations data/coco/train2014/COCO_train2014_000000399262.jpg None Annotations data/coco/train2014/COCO_train2014_000000239942.jpg None Annotations data/coco/train2014/COCO_train2014_000000247177.jpg Converting image 2351/82783 shard 0 Gray Image 434765 Converting image 2401/82783 shard 0 Converting image 2451/82783 shard 0 Converting image 2501/82783 shard 0 Converting image 2551/82783 shard 1 Converting image 2601/82783 shard 1 Converting image 2651/82783 shard 1 Converting image 2701/82783 shard 1 Converting image 2751/82783 shard 1 Converting image 2801/82783 shard 1 Converting image 2851/82783 shard 1 Converting image 2901/82783 shard 1 Converting image 2951/82783 shard 1 Converting image 3001/82783 shard 1 Converting image 3051/82783 shard 1 Converting image 3101/82783 shard 1 Converting image 3151/82783 shard 1 Converting image 3201/82783 shard 1 None Annotations data/coco/train2014/COCO_train2014_000000458540.jpg Converting image 3251/82783 shard 1 ...

tankertyp avatar Aug 14 '18 08:08 tankertyp

well, i had the same problem after i had trained about 6000 steps. the problem did not occur at the beginning of loading the graph and weight of the model. To solve this, i declined batch size from 24 to 16. Now, the train code is still runing, though i do not know whether the code will crash or not later.

hongrui16 avatar May 20 '19 09:05 hongrui16