redner icon indicating copy to clipboard operation
redner copied to clipboard

GPU illegal memory access

Open fhuzero opened this issue 5 years ago • 17 comments

Hi! I followed your instructions and successfully built redner from source on Ubuntu 16.0LTS with python3.7 and CUDA10.1. But I got problems with GPU version.

When I run programs in "/redner/tutorials", such as "01_optimize_single_triangle.py" and "02_pose_estimation.py", they will throw out error CUDA Runtime Error: an illegal memory access was encountered at /home/ubuntu/redner/buffer.h:86. When I set the line pyredner.set_use_gpu(torch.cuda.is_available()) to pyredner.set_use_gpu(False), that is to use CPU version manully, they works fine. Is there anything wrong with my settings? Or what step I need to take to address this problem? Thanks a lot!

Here is the output of nvidia-smi and nvcc --version +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+| | 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 | | N/A 36C P8 29W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+

nvcc: NVIDIA (R) Cuda compiler driver Cuda compilation tools, release 10.0, V10.0.130

fhuzero avatar Dec 05 '19 22:12 fhuzero

Do the same issue happen if you pip install redner-gpu? Maybe also check out the Dockerfile. If you can create an environment (such as a Dockerfile) for me to reproduce this it would be a lot easier for me to investigate.

BachiLi avatar Dec 05 '19 22:12 BachiLi

I've tried both pip install redner-gpu and pip install redner-gpu command at first, but they returned the same error:

ERROR: Could not find a version that satisfies the requirement redner-gpu (from versions: none) ERROR: No matching distribution found for redner-gpu

So I installed it manually.

By the way, pip -V is pip 19.3.1 from /usr/local/lib/python3.5/dist-packages/pip (python 3.5) and python -V is Python 3.7.4

I'll build a docker image to help you to diagnoise the problem, thank you!

fhuzero avatar Dec 05 '19 22:12 fhuzero

Seems that your pip is pointing to a different Python version compared to your main python. Try python -m pip install redner-gpu.

BachiLi avatar Dec 06 '19 00:12 BachiLi

Thanks for your suggestion! I've successfully installed it by python -m pip install redner-gpu.

But unfortunately, the same error remains (CUDA Runtime Error: an illegal memory access was encountered at /tmp/pip-req-build-m4k01szw/buffer.h:86).

fhuzero avatar Dec 06 '19 02:12 fhuzero

Hmm. I'll wait for your Docker image then. You can take a look at manylinux-gpu.Dockerfile to see how I set those up.

BachiLi avatar Dec 06 '19 04:12 BachiLi

By the way, your nvcc version (10.0) doesn't match the CUDA version (10.1) in your nvidia-smi prompt. Not sure if this matters though since pip install didn't fix it.

BachiLi avatar Dec 06 '19 04:12 BachiLi

Could be related to https://github.com/BachiLi/redner/issues/38 I only tested redner on GPUs with compute capability >= 6.0, and Optix prime could behave differently on an older card. I thought I have fixed it by adding optix_scene->finish();, but it could be that there are some other undocumented behaviors of optix prime that can cause a similar issue.

One way to check this is to add a cuda_synchronize() after optix_scene->finish();. Let me know if this fixes your issue or not.

BachiLi avatar Dec 06 '19 05:12 BachiLi

Thanks for the information! I've tried to add the line but it stills has the same error :( I've built a docker image and what way you prefer to share it with you?

fhuzero avatar Dec 06 '19 06:12 fhuzero

Upload your dockerfile. Or maybe upload the image to google drive or dropbox.

BachiLi avatar Dec 06 '19 06:12 BachiLi

Hi I've switched the GPU to Tesla V100 with nvcc 10.0, CUDA 10.1, python 3.6, pip3 3.6, PyTorch 1.3.1 and used the command pip3 install redner-gpu and the program run successfully! It seems to be a version incompatibility issue. Thanks for you prompt help!

fhuzero avatar Dec 06 '19 20:12 fhuzero

Most likely there is something I don't know happening with older GPUs. I'll test it on my own if I manage to get a K80.

BachiLi avatar Dec 06 '19 21:12 BachiLi

I had the same issue with K80 GPU , same environment, same code worked without an issue in P100

Skorkmaz88 avatar Jun 02 '20 00:06 Skorkmaz88

Can confirm that I face the same issue on Tesla K 80 but not on Titan X GPUs. Any idea what's going wrong?

Error Stack on Tesla K 80:

/home/smadan/.local/lib/python3.6/site-packages/pyredner/render_pytorch.py:214: UserWarning: Converting shape vertices from cpu to cuda:0, this can be inefficient.
  warnings.warn('Converting shape vertices from {} to {}, this can be inefficient.'.format(shape.vertices.device, device))
/home/smadan/.local/lib/python3.6/site-packages/pyredner/render_pytorch.py:216: UserWarning: Converting shape indices from cpu to cuda:0, this can be inefficient.
  warnings.warn('Converting shape indices from {} to {}, this can be inefficient.'.format(shape.indices.device, device))
/home/smadan/.local/lib/python3.6/site-packages/pyredner/render_pytorch.py:55: UserWarning: Converting texture from cpu to cuda:0, this can be inefficient.
  warnings.warn('Converting texture from {} to {}, this can be inefficient.'.format(mipmap.device, device))
CUDA Runtime Error: an illegal memory access was encountered at /tmp/pip-req-build-it6swr5w/src/buffer.h:86

I got the same warnings on Titan X, but did not run into the CUDA runtime error.

Spandan-Madan avatar Jul 01 '20 14:07 Spandan-Madan

This is hard for me to debug since I don't have a K80 (and time). I'll see what I can do. Most likely something is wrong with Optix prime. Maybe we should just get rid of it.

BachiLi avatar Jul 01 '20 14:07 BachiLi

Is there an easy to way to disable Optix to check if that solves the problem? If so, I can run tests on Tesla K80 GPUs without optix.

Spandan-Madan avatar Jul 02 '20 04:07 Spandan-Madan

Nope. You can try to modify the ray tracing procedure but it requires a bit of programming efforts.

BachiLi avatar Jul 02 '20 12:07 BachiLi

Sounds good, maybe it's easiest to stick to newer GPUs for now then!

Spandan-Madan avatar Jul 02 '20 15:07 Spandan-Madan