Open3D-ML
Open3D-ML copied to clipboard
Cannot run pipeline with GPU but works with CPU
Checklist
- [X] I have searched for similar issues.
- [X] I have tested with the latest development wheel.
- [X] I have checked the release documentation and the latest documentation (for
master
branch).
Describe the issue
I've been testing Open3D ML pretrained models before I set up a configuration for a custom data set.
I am trying to do this by running the predefined scripts.
The CPU works but GPU is giving me an error:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
Steps to reproduce the bug
$ python scripts/run_pipeline.py torch -c ml3d/configs/randlanet_semantickitti.yml --dataset.dataset_path /home/alex/Desktop/Datasets/SemanticKITTI --pipeline SemanticSegmentation --dataset.use_cache True --split test
Using external Open3D-ML in /home/alex/Desktop/NIST_T3/PROJECT/Open3D-ML
regular arguments
batch_size: null
cfg_dataset: null
cfg_file: ml3d/configs/randlanet_semantickitti.yml
cfg_model: null
cfg_pipeline: null
ckpt_path: null
dataset: null
dataset_path: null
device: gpu
framework: torch
main_log_dir: null
max_epochs: null
mode: null
model: null
pipeline: SemanticSegmentation
seed: 0
split: test
extra arguments
dataset.dataset_path: /home/alex/Desktop/Datasets/SemanticKITTI
dataset.use_cache: 'True'
pipeline.num_workers: '0'
INFO - 2022-04-26 18:17:20,184 - semantic_segmentation - DEVICE : cuda
INFO - 2022-04-26 18:17:20,184 - semantic_segmentation - Logging in file : ./logs/RandLANet_SemanticKITTI_torch/log_test_2022-04-26_18:17:20.txt
INFO - 2022-04-26 18:17:20,222 - semantickitti - Found 20351 pointclouds for test
INFO - 2022-04-26 18:18:40,823 - semantic_segmentation - Initializing from scratch.
INFO - 2022-04-26 18:18:40,825 - semantic_segmentation - Started testing
Error message
Traceback (most recent call last):
File "/home/alex/Desktop/NIST_T3/PROJECT/Open3D-ML/scripts/run_pipeline.py", line 163, in cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
Expected behavior
When --device is set to CPU everything seems to work, however when set to gpu or cuda I get the error.
Open3D, Python and System information
- Operating system: Ubuntu 20.04
- Python version: 3.9
- Open3D version: (output from python: `print(open3d.__version__)`)
- System type: x84
- Is this remote workstation?: yes or no
- How did you install Open3D?: pip
- Compiler version (if built from source): gcc 7.5
Additional information
I've tried with CUDA 11.6 and 10.1 and got the same error This error pops up with different datasets as well (SemanticKITTI and Stanford3D)
Any tips or ideas would be great, thank you!
Same here,
CUDA 11.3 failed, but CUDA 10.2 ok
@conby and @andimo11 Any update on this?
The same problem here with Cuda Version: 11.0
and Torch Version: 1.8.2
.
It works on CPU but not CPU.
I used requirements-torch-cuda.txt
to install a compatible version of torch and cuda
- Operating system: Ubuntu 20.04
- Python version: 3.9.5
- Open3D version: 0.15.2
- Is this remote workstation?: no
- How did you install Open3D?: pip
All,
I am also having the same problem. Checked with Cuda Version: 10.2, 11.1, 11.3, 11.7
and Torch Version: 1.8.2
. Please note that it perfectly works on CPU but not GPU (Tesla V100 - 32GB). I have four of them but currently using 1 GPU by,
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
Update:
Please note that strangely it works well with RTX 2080Ti single GPU only machine with Cuda 10.2
, Ubuntu 22.04
upgraded from 18.04 and 20.04
. In addition, I tested on a 4 Nvidia Titan X machine with Cuda 10.2
and upgraded Ubuntu 22.04
it works flawlessly.
I used requirements-torch-cuda.txt
to install a compatible version of torch and cuda given by Team Open3d-ML
- Operating system: Ubuntu 22.04
- Python version: 3.8.13
- GCC version: 7.5.0
- Open3D version: 0.15.2
- Is this remote workstation?: Yes
- How did you install Open3D?: pip
I have been installing/uninstalling all possible drivers, CUDA toolkit, and cuDNN for the last 3 days. But can't fix the issue, @andimo11, @conby and @shayannikoohemat please let me know if you have any suggestions.
@yxlao please help us to fix this issue ASAP. I cannot use my Tesla V100 GPUs at all.
Looking forward for any swift help!
Hi there, I tried randlanet_semantickitti
on several setups. Here is the result:
Video card | Nvidia driver | CUDA | CuDNN | Works? | |
---|---|---|---|---|---|
local | TITAN X (Pascal) 12GB | 470.141.03 | 10.2, 11.1, 11.4 | 8 | Yes |
local | GeForce RTX 2080 Ti 12GB | 470.141.03 | 11.4 | 8 | Yes |
local | GeForce RTX 3080 Laptop 16GB | 510.85.02 | 11.1 | 8 | Yes |
AWS p3.2xlarge | Tesla V100-SXM2-16GB | 470.57.02 | 10.2, 11.1, 11.4 | 8 and w/o | No |
AWS p3.2xlarge | Tesla V100-SXM2-16GB | 510.73.08 | 11.1 | 8 | No |
AWS p2.xlarge | Tesla K80 12GB | 470.141.03 | 11.1 | 8 | Yes |
AWS g4dn.2xlarge | Tesla T4 16GB | 510.73.08 | 11.1 | 8 | Yes |
From this table, it looks like Tesla V100 never works, while everything else works. This also matches @preethamam's experience.
@andimo11, @conby, @shayannikoohemat, what are your video cards?
Despite also running inference (--split test
) on the semantic segmentation pipeline with randlanet model and torch backend, the abusing line is different for me (compare this to what @andimo11 got).
...
File "/deepmap_workspace/Open3D-ML/ml3d/torch/models/randlanet.py", line 636, in forward
scores = self.score_fn(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
...
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I finally solved the issue by replacing the torch / torchvision packages pre-build by the Open3D team with the ones from public repositories. That is, replace the content of requirements-torch-cuda.txt
with:
torch==1.8.1
torchvision==0.9.1
tensorboard
Open3D will throw a warning "Using the Open3D PyTorch ops with CUDA 11 may have stability issues!", and ask to compile torch with certain flags. But I don't get any problems for my case, so I just ignore the warning.
Additionally Open3D says "Warning: Open3D was built with CUDA 11.0 butPyTorch was built with CUDA 10.2. Falling back to CPU for now.Otherwise, install PyTorch with CUDA 11.0." But I can see via nvidia-smi that GPU is actually used and it is considerably faster than if I disable GPU. So not sure if this warning really means anything.
The solution works for V100 with CUDA 11.1.1 and 11.7.1.
More curious developments. Open3D says it is compiled with CUDA 11.0, so I thought that Pytorch compiled with CUDA 11 should be better, but it is not.
Tesla V100 Ubuntu 18.04 CUDA 11.1 python 3.6
The following requirements-torch-cuda.txt
file installs Pytorch compiled with CUDA 10.2. You can verify that by running python3 -c "import torch; print(torch.version.cuda)"
. This setup works.
torch==1.8.1
torchvision==0.9.1
tensorboard
The requirements-torch-cuda.txt
file below installs Pytorch compiled with CUDA 11.1, which I expected to match better Open3D compiled with CUDA 11.0. However, this results in the CUBLAS_STATUS_EXECUTION_FAILED error. So Pytorch compiled with CUDA 11.1 is the problem, no matter whether it is built by the Open3D team or from the official repos.
-f https://download.pytorch.org/whl/torch_stable.html
torch==1.8.1+cu111
torchvision==0.9.1+cu111
tensorboard
@kukuruza The GPU that I tested was: NVIDIA GeForce GTX 1650