torch icon indicating copy to clipboard operation
torch copied to clipboard

Error converting between cuda tenstor to cpu

Open emauryg opened this issue 3 years ago • 11 comments

When I try to convert a GPU tensor to an R matrix using as_array(my_tensor$cpu()), I get the following error:

Not sure what the memory format error is caused by. Any suggestions?

`Traceback:

  1. init_from_mod(mod0, count_matrix)
  2. Categorical(torch_tensor(1), T0[t_dn, r_dn, , zd[n]])
  3. as_array(probs$cpu())
  4. probs$cpu()
  5. self$to(device = torch_device("cpu"), memory_format = memory_format)
  6. do.call(private$_to, args)
  7. (function (device, options = list(), other, dtype, non_blocking = FALSE, . copy = FALSE, memory_format = NULL) . { . args <- mget(x = c("device", "options", "other", "dtype", . "non_blocking", "copy", "memory_format")) . args <- append(list(self = self), args) . expected_types <- list(self = "Tensor", device = "Device", . options = "TensorOptions", other = "Tensor", dtype = "ScalarType", . non_blocking = "bool", copy = "bool", memory_format = "MemoryFormat") . nd_args <- c("self", "device", "other", "dtype") . return_types <- list(list("Tensor")) . call_c_function(fun_name = "to", args = args, expected_types = expected_types, . nd_args = nd_args, return_types = return_types, fun_type = "method") . })(dtype = <pointer: 0x557601e2a9a0>, device = <pointer: 0x557575a0a670>, . non_blocking = FALSE, copy = FALSE, memory_format = <pointer: 0x557575a0a6b0>)
  8. call_c_function(fun_name = "to", args = args, expected_types = expected_types, . nd_args = nd_args, return_types = return_types, fun_type = "method")
  9. do_call(f, args)
  10. do.call(fun, args)
  11. (function (self, device, dtype, non_blocking, copy, memory_format) . { . .Call("_torch_cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType", . PACKAGE = "torchpkg", self, device, dtype, non_blocking, . copy, memory_format) . })(self = <pointer: 0x557575a0a460>, device = <pointer: 0x557575a0a670>, . dtype = <pointer: 0x557601e2a9a0>, non_blocking = FALSE, . copy = FALSE, memory_format = <pointer: 0x557575a0a6b0>)`

emauryg avatar May 17 '21 21:05 emauryg

Hi @emauryg ,

Can you also post the error message or a small reprex? I was unable to reproduce this behavior.

dfalbel avatar May 17 '21 22:05 dfalbel

Error in (function (self, device, dtype, non_blocking, copy, memory_format) : CUDA error: device-side assert triggered Exception raised from copy_kernel_cuda at /pytorch/aten/src/ATen/native/cuda/Copy.cu:200 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7fe2bb6c7b29 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7fe2bb6c4ab2 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libc10.so) frame #2: <unknown function> + 0x1ecfdbf (0x7fe268485dbf in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cuda.so) frame #3: <unknown function> + 0xf5df13 (0x7fe2a9de4f13 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #4: <unknown function> + 0xf5c2c3 (0x7fe2a9de32c3 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x53 (0x7fe2a9de4153 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #6: at::Tensor::copy_(at::Tensor const&, bool) const + 0x12d (0x7fe2aa8a665d in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #7: <unknown function> + 0x39caf87 (0x7fe2ac851f87 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #8: at::Tensor::copy_(at::Tensor const&, bool) const + 0x12d (0x7fe2aa8a665d in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #9: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0x63b (0x7fe2aa07196b in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #10: <unknown function> + 0x1883eb4 (0x7fe2aa70aeb4 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #11: at::Tensor::to(c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x196 (0x7fe2aa8cb7d6 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #12: _lantern_Tensor_to_tensor_device_scalartype_bool_bool_memoryformat + 0x85 (0x7fe2bbc2b705 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/liblantern.so) frame #13: cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType(XPtrTorchTensor, XPtrTorchDevice, XPtrTorchDtype, bool, bool, XPtrTorchMemoryFormat) + 0x40 (0x7fe2bc40df30 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/libs/torchpkg.so) frame #14: _torch_cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType + 0x161 (0x7fe2bc32d121 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/libs/torchpkg.so) frame #15: <unknown function> + 0xfee56 (0x7fe2d9a37e56 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #16: <unknown function> + 0x13a4b1 (0x7fe2d9a734b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #17: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #18: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #19: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #20: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #21: <unknown function> + 0xcbd1e (0x7fe2d9a04d1e in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #22: <unknown function> + 0x13a4b1 (0x7fe2d9a734b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #23: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #24: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #25: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #26: <unknown function> + 0x13dd87 (0x7fe2d9a76d87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #27: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #28: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #29: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #30: <unknown function> + 0x13dd87 (0x7fe2d9a76d87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #31: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #32: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #33: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #34: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #35: <unknown function> + 0x14ad57 (0x7fe2d9a83d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #36: Rf_eval + 0x5e4 (0x7fe2d9a80ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #37: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #38: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #39: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #40: <unknown function> + 0xcbd1e (0x7fe2d9a04d1e in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #41: <unknown function> + 0x13a4b1 (0x7fe2d9a734b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #42: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #43: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #44: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #45: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #46: <unknown function> + 0x14ad57 (0x7fe2d9a83d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #47: Rf_eval + 0x5e4 (0x7fe2d9a80ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #48: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #49: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #50: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #51: <unknown function> + 0x14ad57 (0x7fe2d9a83d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #52: Rf_eval + 0x5e4 (0x7fe2d9a80ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #53: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #54: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #55: <unknown function> + 0x13dd87 (0x7fe2d9a76d87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #56: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #57: <unknown function> + 0x14801f (0x7fe2d9a8101f in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #58: Rf_eval + 0x4a4 (0x7fe2d9a809a4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #59: <unknown function> + 0x18abb6 (0x7fe2d9ac3bb6 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #60: <unknown function> + 0x13b38b (0x7fe2d9a7438b in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #61: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #62: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #63: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so)

emauryg avatar May 17 '21 23:05 emauryg

This looks like a hard to reproduce error. Can you try setting the CUDA_LAUNCH_BLOCKING=1 environment variable? The error might have been created in a previous CUDA operation and the copy_kernel error is not the correct place. CUDA_LAUNCH_BLOCKING should make cuda report the error correctly.

dfalbel avatar May 17 '21 23:05 dfalbel

I think the error looks the same

`Error in (function (self, device, dtype, non_blocking, copy, memory_format) : CUDA error: device-side assert triggered Exception raised from copy_kernel_cuda at /pytorch/aten/src/ATen/native/cuda/Copy.cu:200 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::_cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x69 (0x7ff69b92cb29 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xd2 (0x7ff69b929ab2 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libc10.so) frame #2: + 0x1ecfdbf (0x7ff6486eadbf in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cuda.so) frame #3: + 0xf5df13 (0x7ff68a049f13 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #4: + 0xf5c2c3 (0x7ff68a0482c3 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #5: at::native::copy(at::Tensor&, at::Tensor const&, bool) + 0x53 (0x7ff68a049153 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #6: at::Tensor::copy(at::Tensor const&, bool) const + 0x12d (0x7ff68ab0b65d in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #7: + 0x39caf87 (0x7ff68cab6f87 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #8: at::Tensor::copy(at::Tensor const&, bool) const + 0x12d (0x7ff68ab0b65d in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #9: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optionalc10::MemoryFormat) + 0x63b (0x7ff68a2d696b in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #10: + 0x1883eb4 (0x7ff68a96feb4 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #11: at::Tensor::to(c10::Device, c10::ScalarType, bool, bool, c10::optionalc10::MemoryFormat) const + 0x196 (0x7ff68ab307d6 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #12: _lantern_Tensor_to_tensor_device_scalartype_bool_bool_memoryformat + 0x85 (0x7ff69be90705 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/liblantern.so) frame #13: cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType(XPtrTorchTensor, XPtrTorchDevice, XPtrTorchDtype, bool, bool, XPtrTorchMemoryFormat) + 0x40 (0x7ff69c672f30 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/libs/torchpkg.so) frame #14: _torch_cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType + 0x161 (0x7ff69c592121 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/libs/torchpkg.so) frame #15: + 0xfee56 (0x7ff6b9c9be56 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #16: + 0x13a4b1 (0x7ff6b9cd74b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #17: Rf_eval + 0x190 (0x7ff6b9ce4690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #18: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #19: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #20: Rf_eval + 0x35c (0x7ff6b9ce485c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #21: + 0xcbd1e (0x7ff6b9c68d1e in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #22: + 0x13a4b1 (0x7ff6b9cd74b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #23: Rf_eval + 0x190 (0x7ff6b9ce4690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #24: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #25: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #26: + 0x13dd87 (0x7ff6b9cdad87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #27: Rf_eval + 0x190 (0x7ff6b9ce4690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #28: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #29: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #30: + 0x13dd87 (0x7ff6b9cdad87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #31: Rf_eval + 0x190 (0x7ff6b9ce4690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #32: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #33: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #34: Rf_eval + 0x35c (0x7ff6b9ce485c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #35: + 0x14ad57 (0x7ff6b9ce7d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #36: Rf_eval + 0x5e4 (0x7ff6b9ce4ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #37: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #38: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #39: Rf_eval + 0x35c (0x7ff6b9ce485c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #40: + 0xcbd1e (0x7ff6b9c68d1e in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #41: + 0x13a4b1 (0x7ff6b9cd74b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #42: Rf_eval + 0x190 (0x7ff6b9ce4690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #43: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #44: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #45: Rf_eval + 0x35c (0x7ff6b9ce485c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #46: + 0x14ad57 (0x7ff6b9ce7d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #47: Rf_eval + 0x5e4 (0x7ff6b9ce4ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #48: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #49: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #50: Rf_eval + 0x35c (0x7ff6b9ce485c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #51: + 0x14ad57 (0x7ff6b9ce7d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #52: Rf_eval + 0x5e4 (0x7ff6b9ce4ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #53: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #54: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #55: + 0x13dd87 (0x7ff6b9cdad87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #56: Rf_eval + 0x190 (0x7ff6b9ce4690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #57: + 0x14801f (0x7ff6b9ce501f in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #58: Rf_eval + 0x4a4 (0x7ff6b9ce49a4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #59: + 0x18abb6 (0x7ff6b9d27bb6 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #60: + 0x13b38b (0x7ff6b9cd838b in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #61: Rf_eval + 0x190 (0x7ff6b9ce4690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #62: + 0x1495b0 (0x7ff6b9ce65b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #63: Rf_applyClosure + 0x175 (0x7ff6b9ce7335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so)

Traceback:

  1. init_from_mod(mod0, count_matrix)
  2. Categorical(torch_tensor(1), T0[t_dn, r_dn, , zd[n]])
  3. as_array(probs$cpu())
  4. probs$cpu()
  5. self$to(device = torch_device("cpu"), memory_format = memory_format)
  6. do.call(private$_to, args)
  7. (function (device, options = list(), other, dtype, non_blocking = FALSE, . copy = FALSE, memory_format = NULL) . { . args <- mget(x = c("device", "options", "other", "dtype", . "non_blocking", "copy", "memory_format")) . args <- append(list(self = self), args) . expected_types <- list(self = "Tensor", device = "Device", . options = "TensorOptions", other = "Tensor", dtype = "ScalarType", . non_blocking = "bool", copy = "bool", memory_format = "MemoryFormat") . nd_args <- c("self", "device", "other", "dtype") . return_types <- list(list("Tensor")) . call_c_function(fun_name = "to", args = args, expected_types = expected_types, . nd_args = nd_args, return_types = return_types, fun_type = "method") . })(dtype = <pointer: 0x55c61ceb1850>, device = <pointer: 0x55c6a5f4afd0>, . non_blocking = FALSE, copy = FALSE, memory_format = <pointer: 0x55c6ac9912c0>)
  8. call_c_function(fun_name = "to", args = args, expected_types = expected_types, . nd_args = nd_args, return_types = return_types, fun_type = "method")
  9. do_call(f, args)
  10. do.call(fun, args)
  11. (function (self, device, dtype, non_blocking, copy, memory_format) . { . .Call("_torch_cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType", . PACKAGE = "torchpkg", self, device, dtype, non_blocking, . copy, memory_format) . })(self = <pointer: 0x55c6a8056260>, device = <pointer: 0x55c6a5f4afd0>, . dtype = <pointer: 0x55c61ceb1850>, non_blocking = FALSE, . copy = FALSE, memory_format = <pointer: 0x55c6ac9912c0>)`

emauryg avatar May 17 '21 23:05 emauryg

I also get something like this on my shell when running the jupyter notebook

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. 1

emauryg avatar May 17 '21 23:05 emauryg

Ok, yes this last post looks like the real cause. Can you share your code or an example that I can run and reproduce the error? This looks like an indexing error, but could be happening from other operations, not necessarily when moving to cpu...

dfalbel avatar May 18 '21 00:05 dfalbel

Hmm I guess that could make sense, since the input to the function where I'm calling the conversion to cpu is a tensor which I have indexed. The indices are computed separately.

for (d in 1:length(cTrain)){
    zd = Categorical(cTrain[d], theta[,d])
    md  = Categorical(cTrain[d],m)

    for(n in 1:length(zd)){
      t_dn = Categorical(torch_tensor(1), factors$bt[,zd[n]])
      r_dn = Categorical(torch_tensor(1), factors$br[,zd[n]])
      n_dn = Categorical(torch_tensor(1), factors$nuc[,zd[n]])
      e_dn = Categorical(torch_tensor(1), factors$epi[,zd[n]])
      c_dn = Categorical(torch_tensor(1), factors$clu[,zd[n]])

      v_dn = Categorical(torch_tensor(1), T0[t_dn, r_dn,,zd[n]]) ## this is where the error gets called 

      if (md[n]$item() %in% c(3,4)){
        t_dn = 2
      }
      if (md[n]$item() %in% c(2,4)){
        r_dn = 2
      }

      Ytrain[t_dn, r_dn, e_dn, n_dn, c_dn, v_dn, d] = Ytrain[t_dn, r_dn, e_dn, n_dn, c_dn, v_dn, d] + 1
    }
}
Categorical <- function(n_samples, probs){
  n_samples = n_samples$item()
  probs = as_array(probs$cpu())
  K = length(probs)
  tmp = rmultinom(n_samples, 1, prob = probs)
  tmp = colSums(c(1:K)*tmp)
  tmp = torch_tensor(tmp, dtype = torch_long(), device=device)
  return(tmp)
}

emauryg avatar May 18 '21 01:05 emauryg

Hi @emauryg ,

Sorry, I can't run this chunk , because I don't know what T0, cTrain and factors are... Yes, it sounds possible that T0[t_dn, r_dn,,zd[n] is doing an out of bound indexing thus causing the error you are seeing. Perhaps you could put print statements right before that indexing and make sure t_dn, r_dn and zd[n] values are what you expected. Then check also the dimensions of T0.

dfalbel avatar May 18 '21 11:05 dfalbel

So the error ended up being a wrong indexing on the first dimension. It seems like a very intense error output (requires restarting the kernel when it happens) for such a common bug that can occur in practice. Fixing the indexing allowed the code to run into completion. Thank you for the help.

emauryg avatar May 18 '21 19:05 emauryg

I agree it's not practical. I'll try to figure out if we can do something around it. The problem is that CUDA operations can be asynchronous and there isn't an easy way to catch errors in the async ops from R.

Maybe a reasonable workaround is to debug first in the CPU?

dfalbel avatar May 18 '21 20:05 dfalbel

I guess that might be a good temporary approach, but yeah I guess looking at the terminal output is also important since in my case it was showing the error of index out of bound. The main confusion came from the notebook showing cuda memory location issue.

emauryg avatar May 18 '21 20:05 emauryg