FastMaskRCNN
FastMaskRCNN copied to clipboard
python train/train.py error
Restored 267(640) vars from ./data/pretrained_models/resnet_v1_50.ckpt
2017-10-19 21:09:57.871541: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function
2017-10-19 21:09:57.894972: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function
[[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]]
2017-10-19 21:09:57.894972: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function
[[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]]
2017-10-19 21:09:57.894972: W tensorflow/core/framework/op_kernel.cc:1192] Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function
[[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Traceback (most recent call last):
File "train/train.py", line 339, in
Caused by op u'pyramid_1/AssignGTBoxes/Where_6', defined at:
File "train/train.py", line 339, in
InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function [[Node: pyramid_1/AssignGTBoxes/Where_6 = Where_device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: pyramid_2/OneHotEncoding_4/one_hot/_1255 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9981_pyramid_2/OneHotEncoding_4/one_hot", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
I can't figure out what is wrong with it,ask for help
I had the same issue. It seems that cuda has some problems: if (first_success != cudaSuccess) { return errors::Internal( "WhereOp: Could not launch cub::DeviceReduce::Sum to calculate " "temp_storage_bytes, status: ", cudaGetErrorString(first_success)); } To solve it I downgraded the tf version from 1.3 to 1.1 #88 and changed cudnn version to 5.1
Have you solved the problem?I also met!!
i have same problem...did anybody manage to fix it... i downgraded my tensorflow to 1.1 but i didn't help
did you change the cudnn version?
I am also getting same error.
Caused by op u'pyramid_1/AssignGTBoxes/Where_7', defined at:
File "train/train.py", line 339, in
InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true indices. temp_storage_bytes: 1, status: invalid device function [[Node: pyramid_1/AssignGTBoxes/Where_7 = Where_device="/job:localhost/replica:0/task:0/gpu:0"]] [[Node: pyramid_1/fully_connected_3/BiasAdd/_2901 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_27365_pyramid_1/fully_connected_3/BiasAdd", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
How to solve this error
@ranwek @ahmedtalbi @smileflank @eliethesaiyan @satya2550 I also meet this problem, do you fix it?
Is it necessary to downgrade the tensorflow version to solve the problem? Any other way?
To solve this issue , i trained mask-rcnn with python 3.6 , cuda 9.0 ,cudnn 7.0, tensorflow 1.8 , opencv 3.4.
"mkdir /data/coco" and put all downloaded data into it,then "python download_and_convert_data.py",you should see this: ...
None Annotations data/coco/train2014/COCO_train2014_000000284128.jpg
Converting image 2151/82783 shard 0 Converting image 2201/82783 shard 0 Converting image 2251/82783 shard 0 None Annotations data/coco/train2014/COCO_train2014_000000167118.jpg Converting image 2301/82783 shard 0 None Annotations data/coco/train2014/COCO_train2014_000000399262.jpg None Annotations data/coco/train2014/COCO_train2014_000000239942.jpg None Annotations data/coco/train2014/COCO_train2014_000000247177.jpg Converting image 2351/82783 shard 0 Gray Image 434765 Converting image 2401/82783 shard 0 Converting image 2451/82783 shard 0 Converting image 2501/82783 shard 0 Converting image 2551/82783 shard 1 Converting image 2601/82783 shard 1 Converting image 2651/82783 shard 1 Converting image 2701/82783 shard 1 Converting image 2751/82783 shard 1 Converting image 2801/82783 shard 1 Converting image 2851/82783 shard 1 Converting image 2901/82783 shard 1 Converting image 2951/82783 shard 1 Converting image 3001/82783 shard 1 Converting image 3051/82783 shard 1 Converting image 3101/82783 shard 1 Converting image 3151/82783 shard 1 Converting image 3201/82783 shard 1 None Annotations data/coco/train2014/COCO_train2014_000000458540.jpg Converting image 3251/82783 shard 1 ...
well, i had the same problem after i had trained about 6000 steps. the problem did not occur at the beginning of loading the graph and weight of the model. To solve this, i declined batch size from 24 to 16. Now, the train code is still runing, though i do not know whether the code will crash or not later.