deepmind-research
deepmind-research copied to clipboard
[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Hi. I'm trying to run the MeshGraphNets model but encountered an error with this command:
python -m meshgraphnets.run_model --mode=train --model=cloth --checkpoint_dir=meshgraphnets/dataset/chk --dataset_dir=meshgraphnets/dataset/flag_simple
The error is:
2023-10-09 14:41:01.689006: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-10-09 14:41:01.689021: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2023-10-09 14:41:01.689035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2023-10-09 14:41:01.689048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2023-10-09 14:41:01.689061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2023-10-09 14:41:01.689074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2023-10-09 14:41:01.689088: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
...
2023-10-09 14:41:14.669500: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2023-10-09 14:41:14.669603: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
[[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
[[Model/loss/Mean/_6711]]
(1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
[[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 130, in <module>
app.run(main)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 125, in main
learner(model, params)
File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 82, in learner
_, step, loss = sess.run([train_op, global_step, loss_op])
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/home/hyojeong/.local/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
[[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[Model/loss/Mean/_6711]]
(1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
[[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
I've searched suggestions to upgrade TensorFlow to version 2.x on stackflows. but the meshgraphnets/requirements.txt specifies tensorflow-gpu>=1.15,<2.
Has anyone faced this issue? Should I upgrade TensorFlow? I did try once, but it caused another problem.
Please let me know if you need the full error details or package versions.
Hi, when I run learning to simulate I met also this problem
*Solution
- check the compatibility for GPU driver/CUDA and cuDNN version/TensorFlow version
- set the memory growth in physic device for tf
我也遇到了类似的问题。 我在网上看到的解释是:tensorflow-gpu==1.15版本对应cuda10.0版本,可是cuda10只能在rtx20系以下运行,我是40系的显卡。只能用cpu进行训练。
当然也可能是其他的问题。
我也遇到了类似的问题。 我在网上看到的解释是:tensorflow-gpu==1.15版本对应cuda10.0版本,但是cuda10只能在rtx20系以下运行,我是40系的显卡。只能用cpu进行训练。
当然也可能是其他的问题。
你好,我在40系显卡中也遇到了上面的问题,有解决思路么? 能升级至 tf2.0么 如回复,不胜感谢
我在网上租了个2080ti,是可以运行的
---- 回复的原邮件 ---- | 发件人 | @.> | | 日期 | 2024年03月13日 00:47 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [google-deepmind/deepmind-research] [MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED (Issue #464) |
我也遇到了类似的问题。 我在网上看到的解释是:tensorflow-gpu==1.15版本对应cuda10.0版本,但是cuda10只能在rtx20系以下运行,我是40系的显卡。只能用cpu进行训练。
当然也可能是其他的问题。
你好,我在40系显卡中也遇到了上面的问题,有解决思路么? 能升级至 tf2.0么 如回复,不胜感谢
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>