unsupervised-deep-homography
unsupervised-deep-homography copied to clipboard
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
what is this?
I turned down batch_ size, but this error also occurs.
Hi @fanzhen12 can you please send the full stack trace, along with your pytorch and CUDA version?
Thanks for your reply, I'm sorry for my late reply. Now let me elaborate on the problems I have encountered.
- I download the repo, after I download it and put COCO dataset in it, the structure of the file is like this:
- As you can see, test2014, train2014 and val2014 are three parts of the coco dataset. I didn't change any part of code, and then, I run the code: I put "python train.py ./train2014/ ./val2014/" into the terminal,like this:
and then, the error appears: "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)
" - I suspect that the graphics card memory is not enough, so I reduced the bathch_ Size, and the network model is shown in the figure:
At runtime, the following outputs exist:
So I guess that if the memory is really insufficient, this situation should not be caused by the large network model. the GPU situation is as follows:
These graphics cards should have enough memory. These are my thoughts on this mistake, or maybe I'm going in the wrong direction, the cause of this problem is not insufficient memory at all, but other reasons. I look forward to you early reply. thank you very much. by the way, my pytorch and CUDA version is: torch 1.8.0+cu111 @teddykoker
I suspect the CUBLAS_STATUS_EXECUTION_FAILED
error is not an out-of-memory issue, but probably some sort of driver issue. Maybe try installing torch+cu113:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
,
which would be closer to the version of CUDA you have on your machine.
Have you been successful in running any other pytorch code on that machine? It seems like this issue would likely be non-specific for this code base.
Closing due to inactivity