deep_gcns
deep_gcns copied to clipboard
The environment errors when run the script 'sh +x train_job.sh'
**** EPOCH 001 ****
Current batch/total batch num: 0/1243 2019-09-18 09:37:40.102211: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-09-18 09:37:40.303894: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-09-18 09:37:42.000486: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-09-18 09:37:42.117158: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order. 2019-09-18 09:37:44.299611: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemmBatched: CUBLAS_STATUS_EXECUTION_FAILED 2019-09-18 09:37:44.299648: E tensorflow/stream_executor/cuda/cuda_blas.cc:2574] Internal: failed BLAS call, see log for details Traceback (most recent call last): File "/home/data/anaconda3/envs/lmt36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/data/anaconda3/envs/lmt36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/data/anaconda3/envs/lmt36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[8,4096,3], b.shape=[8,3,4096], m=4096, n=4096, k=3, batch_size=8 [[{{node tower_0/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/Squeeze, tower_0/transpose)]] [[{{node tower_1/adj_conv_27/bn/cond/add/_4375}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_25108_tower_1/adj_conv_27/bn/cond/add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 327, in
Caused by op 'tower_0/MatMul', defined at:
File "train.py", line 327, in
InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[8,4096,3], b.shape=[8,3,4096], m=4096, n=4096, k=3, batch_size=8 [[node tower_0/MatMul (defined at /disk/tia/tia/deep_gcns/utils/tf_util.py:655) = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/Squeeze, tower_0/transpose)]] [[{{node tower_1/adj_conv_27/bn/cond/add/_4375}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_25108_tower_1/adj_conv_27/bn/cond/add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
My environment is CUDA 9.0; cudnn 7.4; tensorflow-gpu 1.12.0 with two 2080ti Can you help me with this problem?
Can you try the solution from this post? https://stackoverflow.com/questions/37337728/tensorflow-internalerror-blas-sgemm-launch-failed?noredirect=1&lq=1 By adding the following segment to your code
if 'session' in locals() and session is not None:
print('Close interactive session')
session.close()
or
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True, log_device_placement=True))
Thank you for your reply !
I have tried both the above two solutions, none of them works.....dispirited
The error is still here:
InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[8,4096,3], b.shape=[8,3,4096], m=4096, n=4096, k=3, batch_size=8 [[node tower_1/MatMul (defined at /disk/tia/tia/deep_gcns/utils/tf_util.py:655) = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:1"](tower_1/Squeeze, tower_1/transpose)]] [[{{node tower_0/cond_4/strided_slice_1/_1129}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3329_tower_0/cond_4/strided_slice_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]