learn2branch icon indicating copy to clipboard operation
learn2branch copied to clipboard

tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed

Open liebenxj opened this issue 2 years ago • 5 comments

Hello, Thanks for your great work. I am currently trying to reproduce it and stuck with some problems about tensorflow. I have successfully generated training dataset for setcover and meet the following Error when I run the python 03_train_gcnn.py setcover -m baseline:

[2023-02-04 16:21:35.606917] EPOCH 0...
2023-02-04 16:21:51.414814: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "03_train_gcnn.py", line 253, in <module>
    n = pretrain(model=model, dataloader=pretrain_data)
  File "03_train_gcnn.py", line 49, in pretrain
    if not model.pre_train(batched_states, tf.convert_to_tensor(True)):
  File "/scratch/anji/rush_exps/learn2branch/models/baseline/model.py", line 260, in pre_train
    self.call(*args, **kwargs)
  File "/scratch/anji/rush_exps/learn2branch/models/baseline/model.py", line 418, in call
    constraint_features = self.cons_embedding(constraint_features)
  File "/home/anji/softwares/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/home/anji/softwares/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/keras/engine/sequential.py", line 232, in call
    inputs, training=training, mask=mask)
  File "/home/anji/softwares/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/keras/engine/sequential.py", line 250, in _call_and_compute_mask
    x = layer.call(x, **kwargs)
  File "/home/anji/softwares/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/keras/layers/core.py", line 970, in call
    outputs = gen_math_ops.mat_mul(inputs, self.kernel)
  File "/home/anji/softwares/miniconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4586, in mat_mul
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(66090, 5), b.shape=(5, 64), m=66090, n=64, k=5 [Op:MatMul]

I am wondering if there is something wrong with my environment. But I followed the install.md and use python 3.6 and tensorflow-gpu 1.12.0. I will appreciate it if you could help solve the problem.

liebenxj avatar Feb 04 '23 08:02 liebenxj

Hi, I met the same issue. Have you solved it? If you have any solution, please tell me. Thanks!

Root970103 avatar Mar 27 '23 06:03 Root970103

@liebenxj Hello! I heard that you successfully obtained a training sample by running 02_generate_dataset.py, could you please provide me with a setcover training sample data? I failed to run it successfully, but I want to observe the data in the samples, thank you!

TengfeiHou-rgb avatar Apr 23 '24 09:04 TengfeiHou-rgb

@liebenxj hello! I heard that you successfully obtained a training sample by running 02_generate_dataset.py, could you please help me to solve these two problem?

  1. File "02_generate_dataset.py", line 37, in branchexeclp utilities.extract_khalil_variable_features(self.model, [], self.khalil_root_buffer) File "/home/fengruixiang/2024_04/learn2branch/utilities.py", line 305, in extract_khalil_variable_features scip_state = model.getKhalilState(root_buffer, candidates) AttributeError: 'pyscipopt.scip.Model' object has no attribute 'getKhalilState'
  2. File "02_generate_dataset.py", line 134, in make_samples m.setBoolParam('branching/vanillafullstrong/integralcands', True) File "src/pyscipopt/scip.pyx", line 2783, in pyscipopt.scip.Model.setBoolParam File "src/pyscipopt/scip.pyx", line 216, in pyscipopt.scip.PY_SCIP_CALL KeyError: 'SCIP: the parameter with the given name was not found!'

Frx12138 avatar May 10 '24 08:05 Frx12138