nglod icon indicating copy to clipboard operation
nglod copied to clipboard

Question on the environment required to run sol-renderer

Open heiwang1997 opened this issue 3 years ago • 8 comments

Hi @tovacinni , thanks for this great work and the code release. I am trying to run your C++ renderer and meet the following segmentation fault. Can you guide me on how to solve this issue, at your convenience?

The system is Ubuntu 20.04. I've tried both rtx3090 and 1080 and neither of them works. By the way, the python part works well -- I can run the training and generate the rendered armadillo. The libtorch is downloaded from https://download.pytorch.org/libtorch/cu111/libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcu111.zip

Here is the error message:

    (nglod) my@ws:~/nglod/sol-renderer/build$ ./sdfRenderer ../../sdf-net/_results/armadillo.npz
    NLOD Demo starting...
    GPU Device 0: "Ampere" with compute capability 8.6
    
    terminate called after throwing an instance of 'c10::Error'
      what():  CUDA error: an illegal memory access was encountered
    Exception raised from nonzero_cuda_out_impl at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:873 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f6705badb29 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f6705baaab2 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libc10.so)
    frame #2: void at::native::nonzero_cuda_out_impl<bool>(at::Tensor const&, at::Tensor&) + 0xebe (0x7f66a6227c4e in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #3: at::native::nonzero_out_cuda(at::Tensor&, at::Tensor const&) + 0x1eb (0x7f66a6199c5b in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #4: at::native::nonzero_cuda(at::Tensor const&) + 0xea (0x7f66a619a09a in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #5: <unknown function> + 0x2e6a80b (0x7f66a6fd180b in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #6: <unknown function> + 0x2e6a890 (0x7f66a6fd1890 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #7: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, at::Tensor const&) const + 0xe7 (0x7f6692f17c57 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #8: at::nonzero(at::Tensor const&) + 0x5e (0x7f6692d5338e in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #9: <unknown function> + 0x2f15a3e (0x7f6694791a3e in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #10: <unknown function> + 0x2f15ac0 (0x7f6694791ac0 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #11: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, at::Tensor const&) const + 0xe7 (0x7f6692f17c57 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #12: at::nonzero(at::Tensor const&) + 0x5e (0x7f6692d5338e in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #13: <unknown function> + 0x4222b (0x555f01cd522b in ./sdfRenderer)
    frame #14: <unknown function> + 0x27750 (0x555f01cba750 in ./sdfRenderer)
    frame #15: <unknown function> + 0x1819a (0x555f01cab19a in ./sdfRenderer)
    frame #16: <unknown function> + 0x20194 (0x7f67060ed194 in /lib/x86_64-linux-gnu/libglut.so.3)
    frame #17: fgEnumWindows + 0x39 (0x7f67060f0c39 in /lib/x86_64-linux-gnu/libglut.so.3)
    frame #18: glutMainLoopEvent + 0x1cd (0x7f67060ed7bd in /lib/x86_64-linux-gnu/libglut.so.3)
    frame #19: glutMainLoop + 0x65 (0x7f67060edff5 in /lib/x86_64-linux-gnu/libglut.so.3)
    frame #20: <unknown function> + 0x18edc (0x555f01cabedc in ./sdfRenderer)
    frame #21: __libc_start_main + 0xf3 (0x7f6617f1a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #22: <unknown function> + 0x1639e (0x555f01ca939e in ./sdfRenderer)
    
    Aborted (core dumped)

heiwang1997 avatar May 21 '21 15:05 heiwang1997

Thanks for your interest in our work!

What version of libtorch are you using? The code was tested on 1.7.1, and using a newer version may cause issues (but I haven't actually tried).

tovacinni avatar May 21 '21 15:05 tovacinni

I was using 1.8.1. But just now I tried 1.7.1, which can be downloaded from here, but still no luck -- the error is the same 🤔

I saw in the requirements.txt that for the python renderer the pytorch version should be 1.6. Does the version of libtorch and pytorch have to be the same?

heiwang1997 avatar May 21 '21 16:05 heiwang1997

Thanks for trying that out. If you can share with me the NPZ file you generated on Google Drive or something, I can try running it on my side & try to reproduce.

The Python PyTorch version shouldn't matter in theory, since it uses NPZ to bridge between the two and the C++ version uses its own separate PyTorch (libtorch).

tovacinni avatar May 21 '21 16:05 tovacinni

Thanks for the fast response! Here is the npz file: https://drive.google.com/file/d/1EcGrddM3kS_IbVVuS8_3zvja6PCswv1i/view?usp=sharing

heiwang1997 avatar May 21 '21 16:05 heiwang1997

I just tried the NPZ and I got the same error too, but still works on the NPZs I have. There might be an issue with the NPZ export in the released code, so I'll take a deeper look at this later today.

tovacinni avatar May 21 '21 16:05 tovacinni

Cool! Thanks for your help. Looking forward to your reply.

heiwang1997 avatar May 21 '21 16:05 heiwang1997

Hi @heiwang1997 ,

did you try upgrading PyTorch? I was trying to run nglod on an A4000 gpu and figured that PyTorch 1.6 does not support ampere architecture. Upgrading to latest PyTorch worked.

sixftninja avatar Nov 22 '21 21:11 sixftninja

Hi@heiwang1997, I also met these errors; how did you solve this question in the end?

Sylva-Lin avatar Dec 20 '22 13:12 Sylva-Lin