MVSNet icon indicating copy to clipboard operation
MVSNet copied to clipboard

关于训练过程中出现的一个莫名错误

Open xiaohythu opened this issue 6 years ago • 4 comments

运行信息: ubuntu 16 GTX 2080 Python2.7 cudatoolkit 9.0 cudnn 7.1.2

错误如下,请问问题可能出在哪里 2019-05-11 00:20:11.500263: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-05-11 00:20:11.873102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:88:00.0 totalMemory: 10.73GiB freeMemory: 10.57GiB 2019-05-11 00:20:11.873190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2019-05-11 00:20:12.375421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-05-11 00:20:12.375482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2019-05-11 00:20:12.375490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2019-05-11 00:20:12.376248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10213 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:88:00.0, compute capability: 7.5) Forward pass: d_min = 425.000000, d_max = 931.150000. 2019-05-11 00:21:27.889785: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x562e4471b430 Forward pass: d_min = 425.000000, d_max = 931.150000. 2019-05-11 00:21:28.671645: E tensorflow/stream_executor/cuda/cuda_blas.cc:654] failed to run cuBLAS routine cublasSgemv_v2: CUBLAS_STATUS_EXECUTION_FAILED 2019-05-11 00:21:29.142631: I tensorflow/stream_executor/stream.cc:4737] stream 0x562e44930550 did not memcpy host-to-device; source: 0x7fb5ae822b00 2019-05-11 00:21:29.142743: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at matrix_inverse_op.cc:191 : Internal: MatInvBatched: failed to copy pointers to device Traceback (most recent call last): File "train.py", line 352, in tf.app.run() File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "train.py", line 347, in main train(sample_list) File "train.py", line 313, in train [summary_op, train_opt, loss, less_one_accuracy, less_three_accuracy]) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMV launch failed : a.shape=[1,3,3], b.shape=[1,3,1], m=3, n=1, k=3 [[Node: Model_tower0/get_homographies/MatMul_1 = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Model_tower0/get_homographies/transpose_1, Model_tower0/get_homographies/Squeeze_5)]] [[Node: Model_tower0/gradients/AddN_515/_2989 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_94560_Model_tower0/gradients/AddN_515", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op u'Model_tower0/get_homographies/MatMul_1', defined at: File "train.py", line 352, in tf.app.run() File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "train.py", line 347, in main train(sample_list) File "train.py", line 220, in train images, cams, FLAGS.max_d, depth_start, depth_interval, is_master_gpu) File "/home/xhy/depth/MVS/mvsnet/model.py", line 98, in inference depth_start=depth_start, depth_interval=depth_interval) File "/home/xhy/depth/MVS/mvsnet/homography_warping.py", line 32, in get_homographies c_right = -tf.matmul(R_right_trans, tf.squeeze(t_right, axis=1)) # (B, D, 3, 1) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2084, in matmul a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1236, in batch_mat_mul "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op op_def=op_def) File "/home/xhy/anaconda3/envs/py97/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): Blas xGEMV launch failed : a.shape=[1,3,3], b.shape=[1,3,1], m=3, n=1, k=3 [[Node: Model_tower0/get_homographies/MatMul_1 = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Model_tower0/get_homographies/transpose_1, Model_tower0/get_homographies/Squeeze_5)]] [[Node: Model_tower0/gradients/AddN_515/_2989 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_94560_Model_tower0/gradients/AddN_515", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

xiaohythu avatar May 11 '19 01:05 xiaohythu

你是用dtu数据集进行训练的吗,有没有对图像大小进行过改变呢

YoYo000 avatar May 24 '19 03:05 YoYo000

请问能不能在我的笔记本上跑这个代码,我的是ubuntu1804,gtx1060(3G),i7-6700HQ ,16G内存,64位

x1597275 avatar Mar 26 '20 05:03 x1597275

@x1597275 应该可以,显存大于11G,应该就没问题,我们1080的显卡(显存11g)是可以跑的。

zjd1988 avatar May 14 '20 08:05 zjd1988

@x1597275 他是1060只有3g顯存所以是不行的

kwea123 avatar May 14 '20 08:05 kwea123