nerf icon indicating copy to clipboard operation
nerf copied to clipboard

ERROR: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Open smoreira00 opened this issue 2 years ago • 10 comments

After setting up the conda environment, activating it, and downloading the data: conda env create -f environment.yml conda activate nerf bash download_example_data.sh I try to run python run_nerf.py --config config_lego.txt and I get the following error before the model starts training:

tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Traceback: ... tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(65536, 63), b.shape=(63, 256), m=65536, n=256, k=63 [Op:MatMul]

I already googled it, but none of the solutions available worked in my case. Could someone help me, please?

smoreira00 avatar Jan 23 '23 16:01 smoreira00

I believe issue #127 is related to this.

ethanmeade avatar Jan 27 '23 11:01 ethanmeade

I met the same problem, and I cannot visit #127. Could someone help me?

WhenMelancholy avatar Apr 17 '23 00:04 WhenMelancholy

I met the same problem, and I cannot visit #127. Could someone help me?

hi guys, has you resolved the problem, I met the same one

dongluyang avatar Apr 21 '23 11:04 dongluyang

Just ran into this one as well. I am on Win11, with a RTX3070 GPU with 8GB RAM ... maybe it's related to needing more VRAM to run this?

// Update: by playing with parameters chunk and netchunk i seem to get it to run... for a bit: The defautl values of 32K and 64K seem far too much for my home GPU to take at once... i am also seeing it run some mroe now, then run into this CUBLAS error code eventually, but i can clearly see how the VRAM gets filled over time with smaller chunk values. I have lowered it to chunk=4 and netchunk=8 as a test.

Reading throug the parameters, i will also try lowering the batch size, or even with the no_batching option and see if then i get it to run fully

ragotiteb avatar Apr 23 '23 18:04 ragotiteb

please migrate tensorflow-gpu to 2.8, it will be resolved

dongluyang avatar Apr 26 '23 14:04 dongluyang

I met the same problem, and I cannot visit #127. Could someone help me?

hi guys, has you resolved the problem, I met the same one

I encountered this issue because my CUDA version was higher than what TensorFlow 1.15 supported. After using a version of TensorFlow modified by NVIDIA to support the higher CUDA version, the issue was resolved.

WhenMelancholy avatar Apr 26 '23 15:04 WhenMelancholy

please refer this link https://www.toutiao.com/article/7226347983518827011/, I have modified environment.yml for hardware with RTX 3090, as mentioned above, the root cause is that tensorflow and cuda as well as tensorflow-gpu do not matched

dongluyang avatar Apr 28 '23 00:04 dongluyang

I met the same problem, and I cannot visit #127. Could someone help me?

hi guys, has you resolved the problem, I met the same one

I encountered this issue because my CUDA version was higher than what TensorFlow 1.15 supported. After using a version of TensorFlow modified by NVIDIA to support the higher CUDA version, the issue was resolved.

Did you just pip install --user nvidia-pyindex and then call python run_nerf.py --config config_lego.txt again? I'm kind of lost how the nvidia-pyindex comes to play/how to use it

jexiaong avatar May 12 '23 20:05 jexiaong

please refer this link https://www.toutiao.com/article/7226347983518827011/, I have modified environment.yml for hardware with RTX 3090, as mentioned above, the root cause is that tensorflow and cuda as well as tensorflow-gpu do not matched

Thank you so much, this was very helpful, both the fork with tf2 support and the blog post. Now i have this running at max capabcity of my 3070 on win11.

Two minor things i found that you might want to check

  1. i had a rather minor dynamic library load error , whcih is easily fixed by using cudatoolkit 11.2 instead od 11 and cudnn 8.1 instead of 8.0 , which are the exact ones recommended for tensorflow 2.8 as per https://www.tensorflow.org/install/source#gpu
  2. I also had a problem with imageio because apparently the latest versions of it now no longer recognize the "ignoregamma" parameter used in this codebase. So it is better to specify a version of imageio (i tried with 2.16 and it worked , might not be the ideal one, newer ones may work too)

ragotiteb avatar May 15 '23 16:05 ragotiteb