torch
torch copied to clipboard
Error converting between cuda tenstor to cpu
When I try to convert a GPU tensor to an R matrix using as_array(my_tensor$cpu())
, I get the following error:
Not sure what the memory format error is caused by. Any suggestions?
`Traceback:
- init_from_mod(mod0, count_matrix)
- Categorical(torch_tensor(1), T0[t_dn, r_dn, , zd[n]])
- as_array(probs$cpu())
- probs$cpu()
- self$to(device = torch_device("cpu"), memory_format = memory_format)
- do.call(private$
_to
, args) - (function (device, options = list(), other, dtype, non_blocking = FALSE, . copy = FALSE, memory_format = NULL) . { . args <- mget(x = c("device", "options", "other", "dtype", . "non_blocking", "copy", "memory_format")) . args <- append(list(self = self), args) . expected_types <- list(self = "Tensor", device = "Device", . options = "TensorOptions", other = "Tensor", dtype = "ScalarType", . non_blocking = "bool", copy = "bool", memory_format = "MemoryFormat") . nd_args <- c("self", "device", "other", "dtype") . return_types <- list(list("Tensor")) . call_c_function(fun_name = "to", args = args, expected_types = expected_types, . nd_args = nd_args, return_types = return_types, fun_type = "method") . })(dtype = <pointer: 0x557601e2a9a0>, device = <pointer: 0x557575a0a670>, . non_blocking = FALSE, copy = FALSE, memory_format = <pointer: 0x557575a0a6b0>)
- call_c_function(fun_name = "to", args = args, expected_types = expected_types, . nd_args = nd_args, return_types = return_types, fun_type = "method")
- do_call(f, args)
- do.call(fun, args)
- (function (self, device, dtype, non_blocking, copy, memory_format) . { . .Call("_torch_cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType", . PACKAGE = "torchpkg", self, device, dtype, non_blocking, . copy, memory_format) . })(self = <pointer: 0x557575a0a460>, device = <pointer: 0x557575a0a670>, . dtype = <pointer: 0x557601e2a9a0>, non_blocking = FALSE, . copy = FALSE, memory_format = <pointer: 0x557575a0a6b0>)`
Hi @emauryg ,
Can you also post the error message or a small reprex? I was unable to reproduce this behavior.
Error in (function (self, device, dtype, non_blocking, copy, memory_format) : CUDA error: device-side assert triggered Exception raised from copy_kernel_cuda at /pytorch/aten/src/ATen/native/cuda/Copy.cu:200 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7fe2bb6c7b29 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7fe2bb6c4ab2 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libc10.so) frame #2: <unknown function> + 0x1ecfdbf (0x7fe268485dbf in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cuda.so) frame #3: <unknown function> + 0xf5df13 (0x7fe2a9de4f13 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #4: <unknown function> + 0xf5c2c3 (0x7fe2a9de32c3 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x53 (0x7fe2a9de4153 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #6: at::Tensor::copy_(at::Tensor const&, bool) const + 0x12d (0x7fe2aa8a665d in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #7: <unknown function> + 0x39caf87 (0x7fe2ac851f87 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #8: at::Tensor::copy_(at::Tensor const&, bool) const + 0x12d (0x7fe2aa8a665d in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #9: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0x63b (0x7fe2aa07196b in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #10: <unknown function> + 0x1883eb4 (0x7fe2aa70aeb4 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #11: at::Tensor::to(c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x196 (0x7fe2aa8cb7d6 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/./libtorch_cpu.so) frame #12: _lantern_Tensor_to_tensor_device_scalartype_bool_bool_memoryformat + 0x85 (0x7fe2bbc2b705 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/deps/liblantern.so) frame #13: cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType(XPtrTorchTensor, XPtrTorchDevice, XPtrTorchDtype, bool, bool, XPtrTorchMemoryFormat) + 0x40 (0x7fe2bc40df30 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/libs/torchpkg.so) frame #14: _torch_cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType + 0x161 (0x7fe2bc32d121 in /home/ch192804/anaconda3/envs/torch/lib/R/library/torch/libs/torchpkg.so) frame #15: <unknown function> + 0xfee56 (0x7fe2d9a37e56 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #16: <unknown function> + 0x13a4b1 (0x7fe2d9a734b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #17: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #18: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #19: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #20: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #21: <unknown function> + 0xcbd1e (0x7fe2d9a04d1e in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #22: <unknown function> + 0x13a4b1 (0x7fe2d9a734b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #23: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #24: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #25: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #26: <unknown function> + 0x13dd87 (0x7fe2d9a76d87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #27: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #28: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #29: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #30: <unknown function> + 0x13dd87 (0x7fe2d9a76d87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #31: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #32: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #33: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #34: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #35: <unknown function> + 0x14ad57 (0x7fe2d9a83d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #36: Rf_eval + 0x5e4 (0x7fe2d9a80ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #37: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #38: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #39: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #40: <unknown function> + 0xcbd1e (0x7fe2d9a04d1e in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #41: <unknown function> + 0x13a4b1 (0x7fe2d9a734b1 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #42: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #43: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #44: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #45: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #46: <unknown function> + 0x14ad57 (0x7fe2d9a83d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #47: Rf_eval + 0x5e4 (0x7fe2d9a80ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #48: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #49: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #50: Rf_eval + 0x35c (0x7fe2d9a8085c in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #51: <unknown function> + 0x14ad57 (0x7fe2d9a83d57 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #52: Rf_eval + 0x5e4 (0x7fe2d9a80ae4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #53: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #54: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #55: <unknown function> + 0x13dd87 (0x7fe2d9a76d87 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #56: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #57: <unknown function> + 0x14801f (0x7fe2d9a8101f in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #58: Rf_eval + 0x4a4 (0x7fe2d9a809a4 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #59: <unknown function> + 0x18abb6 (0x7fe2d9ac3bb6 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #60: <unknown function> + 0x13b38b (0x7fe2d9a7438b in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #61: Rf_eval + 0x190 (0x7fe2d9a80690 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #62: <unknown function> + 0x1495b0 (0x7fe2d9a825b0 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so) frame #63: Rf_applyClosure + 0x175 (0x7fe2d9a83335 in /home/ch192804/anaconda3/envs/torch/lib/R/bin/exec/../../lib/libR.so)
This looks like a hard to reproduce error. Can you try setting the CUDA_LAUNCH_BLOCKING=1
environment variable?
The error might have been created in a previous CUDA operation and the copy_kernel
error is not the correct place. CUDA_LAUNCH_BLOCKING should make cuda report the error correctly.
I think the error looks the same
`Error in (function (self, device, dtype, non_blocking, copy, memory_format) : CUDA error: device-side assert triggered
Exception raised from copy_kernel_cuda at /pytorch/aten/src/ATen/native/cuda/Copy.cu:200 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::_cxx11::basic_string<char, std::char_traits
Traceback:
- init_from_mod(mod0, count_matrix)
- Categorical(torch_tensor(1), T0[t_dn, r_dn, , zd[n]])
- as_array(probs$cpu())
- probs$cpu()
- self$to(device = torch_device("cpu"), memory_format = memory_format)
- do.call(private$
_to
, args) - (function (device, options = list(), other, dtype, non_blocking = FALSE, . copy = FALSE, memory_format = NULL) . { . args <- mget(x = c("device", "options", "other", "dtype", . "non_blocking", "copy", "memory_format")) . args <- append(list(self = self), args) . expected_types <- list(self = "Tensor", device = "Device", . options = "TensorOptions", other = "Tensor", dtype = "ScalarType", . non_blocking = "bool", copy = "bool", memory_format = "MemoryFormat") . nd_args <- c("self", "device", "other", "dtype") . return_types <- list(list("Tensor")) . call_c_function(fun_name = "to", args = args, expected_types = expected_types, . nd_args = nd_args, return_types = return_types, fun_type = "method") . })(dtype = <pointer: 0x55c61ceb1850>, device = <pointer: 0x55c6a5f4afd0>, . non_blocking = FALSE, copy = FALSE, memory_format = <pointer: 0x55c6ac9912c0>)
- call_c_function(fun_name = "to", args = args, expected_types = expected_types, . nd_args = nd_args, return_types = return_types, fun_type = "method")
- do_call(f, args)
- do.call(fun, args)
- (function (self, device, dtype, non_blocking, copy, memory_format) . { . .Call("_torch_cpp_torch_method_to_self_Tensor_device_Device_dtype_ScalarType", . PACKAGE = "torchpkg", self, device, dtype, non_blocking, . copy, memory_format) . })(self = <pointer: 0x55c6a8056260>, device = <pointer: 0x55c6a5f4afd0>, . dtype = <pointer: 0x55c61ceb1850>, non_blocking = FALSE, . copy = FALSE, memory_format = <pointer: 0x55c6ac9912c0>)`
I also get something like this on my shell when running the jupyter notebook
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [95,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. 1
Ok, yes this last post looks like the real cause. Can you share your code or an example that I can run and reproduce the error? This looks like an indexing error, but could be happening from other operations, not necessarily when moving to cpu
...
Hmm I guess that could make sense, since the input to the function where I'm calling the conversion to cpu is a tensor which I have indexed. The indices are computed separately.
for (d in 1:length(cTrain)){
zd = Categorical(cTrain[d], theta[,d])
md = Categorical(cTrain[d],m)
for(n in 1:length(zd)){
t_dn = Categorical(torch_tensor(1), factors$bt[,zd[n]])
r_dn = Categorical(torch_tensor(1), factors$br[,zd[n]])
n_dn = Categorical(torch_tensor(1), factors$nuc[,zd[n]])
e_dn = Categorical(torch_tensor(1), factors$epi[,zd[n]])
c_dn = Categorical(torch_tensor(1), factors$clu[,zd[n]])
v_dn = Categorical(torch_tensor(1), T0[t_dn, r_dn,,zd[n]]) ## this is where the error gets called
if (md[n]$item() %in% c(3,4)){
t_dn = 2
}
if (md[n]$item() %in% c(2,4)){
r_dn = 2
}
Ytrain[t_dn, r_dn, e_dn, n_dn, c_dn, v_dn, d] = Ytrain[t_dn, r_dn, e_dn, n_dn, c_dn, v_dn, d] + 1
}
}
Categorical <- function(n_samples, probs){
n_samples = n_samples$item()
probs = as_array(probs$cpu())
K = length(probs)
tmp = rmultinom(n_samples, 1, prob = probs)
tmp = colSums(c(1:K)*tmp)
tmp = torch_tensor(tmp, dtype = torch_long(), device=device)
return(tmp)
}
Hi @emauryg ,
Sorry, I can't run this chunk , because I don't know what T0
, cTrain
and factors
are...
Yes, it sounds possible that T0[t_dn, r_dn,,zd[n]
is doing an out of bound indexing thus causing the error you are seeing.
Perhaps you could put print statements right before that indexing and make sure t_dn
, r_dn
and zd[n]
values are what you expected. Then check also the dimensions of T0
.
So the error ended up being a wrong indexing on the first dimension. It seems like a very intense error output (requires restarting the kernel when it happens) for such a common bug that can occur in practice. Fixing the indexing allowed the code to run into completion. Thank you for the help.
I agree it's not practical. I'll try to figure out if we can do something around it. The problem is that CUDA operations can be asynchronous and there isn't an easy way to catch errors in the async ops from R.
Maybe a reasonable workaround is to debug first in the CPU?
I guess that might be a good temporary approach, but yeah I guess looking at the terminal output is also important since in my case it was showing the error of index out of bound. The main confusion came from the notebook showing cuda memory location issue.