deepmind-research [MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION

Hi. I'm trying to run the MeshGraphNets model but encountered an error with this command:

python -m meshgraphnets.run_model --mode=train --model=cloth --checkpoint_dir=meshgraphnets/dataset/chk --dataset_dir=meshgraphnets/dataset/flag_simple

The error is:

2023-10-09 14:41:01.689006: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-10-09 14:41:01.689021: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2023-10-09 14:41:01.689035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2023-10-09 14:41:01.689048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2023-10-09 14:41:01.689061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2023-10-09 14:41:01.689074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2023-10-09 14:41:01.689088: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

...

2023-10-09 14:41:14.669500: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2023-10-09 14:41:14.669603: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Traceback (most recent call last):
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
         [[Model/loss/Mean/_6711]]
  (1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[{{node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 130, in <module>
    app.run(main)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 125, in main
    learner(model, params)
  File "/home/hyojeong/dev/download/deepmind-research/meshgraphnets/run_model.py", line 82, in learner
    _, step, loss = sess.run([train_op, global_step, loss_op])
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/hyojeong/.local/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/hyojeong/miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[Model/loss/Mean/_6711]]
  (1) Internal: Blas GEMM launch failed : a.shape=(9212, 7), b.shape=(7, 128), m=9212, n=128, k=7
         [[node Model/loss/EncodeProcessDecode/encoder/sequential_1/mlp_1/linear_0/MatMul (defined at /miniconda3/envs/MeshGraphNets/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

I've searched suggestions to upgrade TensorFlow to version 2.x on stackflows. but the meshgraphnets/requirements.txt specifies tensorflow-gpu>=1.15,<2.

Has anyone faced this issue? Should I upgrade TensorFlow? I did try once, but it caused another problem.

Please let me know if you need the full error details or package versions.

Oct 09 '23 05:10 hjyu94

Hi, when I run learning to simulate I met also this problem

*Solution

check the compatibility for GPU driver/CUDA and cuDNN version/TensorFlow version
set the memory growth in physic device for tf

Dec 06 '23 01:12 BoyuanTang331

我也遇到了类似的问题。我在网上看到的解释是：tensorflow-gpu==1.15版本对应cuda10.0版本，可是cuda10只能在rtx20系以下运行，我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

Jan 29 '24 06:01 Xiaozl11

我也遇到了类似的问题。我在网上看到的解释是：tensorflow-gpu==1.15版本对应cuda10.0版本，但是cuda10只能在rtx20系以下运行，我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

你好，我在40系显卡中也遇到了上面的问题，有解决思路么？能升级至 tf2.0么如回复，不胜感谢

Mar 12 '24 16:03 kikispy

我在网上租了个2080ti，是可以运行的

---- 回复的原邮件 ---- | 发件人 | @.> | | 日期 | 2024年03月13日 00:47 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [google-deepmind/deepmind-research] [MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED (Issue #464) |

我也遇到了类似的问题。我在网上看到的解释是：tensorflow-gpu==1.15版本对应cuda10.0版本，但是cuda10只能在rtx20系以下运行，我是40系的显卡。只能用cpu进行训练。

当然也可能是其他的问题。

你好，我在40系显卡中也遇到了上面的问题，有解决思路么？能升级至 tf2.0么如回复，不胜感谢

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Mar 12 '24 16:03 Xiaozl11

deepmind-research deepmind-research copied to clipboard

[MeshGraphNets] cuda_blas.cc:428, failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

deepmind-research
deepmind-research copied to clipboard