warp-transducer
warp-transducer copied to clipboard
CUDA error: an illegal memory access was encountered
Hello, I'm facing the following error when using your package. It appears randomly after some epochs. Do you have an idea about where it could come from ?
File "main_rnnt.py", line 86, in <module>
model.train()
File "/gpfs1/dlocal/run/7027505/pytorch/rnnt/RNNT.py", line 174, in train
batch_metrics = self.train_batch(x, y)
File "/gpfs1/dlocal/run/7027505/pytorch/rnnt/RNNT.py", line 286, in train_batch
loss = loss_func(pred, y.permute(1, 0).contiguous(), x_len, y_len)
File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/warprnnt_pytorch-0.1-py3.6-linux-x86_64.egg/warprnnt_pytorch/__init__.py", line 100, in forward
return self.loss(acts, labels, act_lens, label_lens, self.blank, self.reduction)
File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/warprnnt_pytorch-0.1-py3.6-linux-x86_64.egg/warprnnt_pytorch/__init__.py", line 40, in forward
grads /= minibatch_size
RuntimeError: CUDA error: an illegal memory access was encountered
CentOS-7 CUDA 10.0 python 3.6.9 torch 1.2 gcc 7.3.0 GPU : Tesla P100-PCIE-12GB
Getting the same. Any fix? @FactoDeepLearning @HawkAaron
EDIT This was due to me not putting acts, labels, input_len, and label_len to .cuda() in pytorch. Fix now.
EDIT2 I'm still getting it now. It'll train at first then get this error after X iterations.
After some debugging, I think there might be a bug in this library @HawkAaron. I am printing the cost at this line here https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/init.py#L37 and the RuntimeError: CUDA error: an illegal memory access was encountered
only happens when cost is printing out as 0.
. I am assuming that the loss_fn https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/init.py#L27 is not updating the cost of gradients causing it to error out. Any ideas?
Also this issue is fixed when running on cpu
and there are no 0
costs.
Same issues.
I think #64 will fix this issue.
My version is latest. When using warp-transducer in espnet, the error still exist as "CUDA error: an illegal memory access was encountered". I discuss it in espnet project. But they think it is a problem of transducer.
https://github.com/espnet/espnet/issues/1860#issuecomment-651040485
My warp-transducer version is as follows. Merge: c1a265f 5098002 Author: Mingkun Huang [email protected] Date: Mon Apr 27 23:07:35 2020 +0800
Merge pull request #66 from kamo-naoyuki/pt1.5
Support pytorch1.5
@housebaby which kind of GPU did you use?
@housebaby which kind of GPU did you use?
Tesla V100
It will not always fail. In some cases, either using 4 or 8 cards, it works. But when I just change the batch size of the successful case ( or learning_rate) , it fail. It is confusing
Same issue. When the batchsizes=3, it passed. When the batchsizes is set higher, it failed.
Oh, right, there's an overflow issue at compute_grad_kernel
:
// 0 <= col < batch * T * U
int col = blockIdx.x;
// col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int
Tp logpk = denom[col] + acts[col * alphabet_size + idx];
cuda-memcheck seems to catch such problem with batch=1, src=53688, tgt=1+1, vocab=20000
(53688 * 2 * 20000 > INT_MAX).
I also suspect that there are similar overflow issues at ReduceHelper
, but I haven't checked them properly.
Oh, right, there's an overflow issue at
compute_grad_kernel
:// 0 <= col < batch * T * U int col = blockIdx.x; // col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int Tp logpk = denom[col] + acts[col * alphabet_size + idx];
cuda-memcheck seems to catch such problem with
batch=1, src=53688, tgt=1+1, vocab=20000
(53688 * 2 * 20000 > INT_MAX). I also suspect that there are similar overflow issues atReduceHelper
, but I haven't checked them properly.
Cool . Then how should we solve this overflow problem. And will modification on this problem be updated to warp-transducer soon? @HawkAaron @jaesong
I don't know if this is related but after upgrading to Tensorflow 2.5.0 (and therefore to CUDA 11.1) I am seeing this when training RNN-based transducer models. The loss either gets nan
or I see the following error:
2021-06-17 17:23:44.905116: E tensorflow/stream_executor/dnn.cc:729] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1990): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2021-06-17 17:23:44.905169: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cudnn_rnn_ops.cc:1560 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 768, 768, 1, 29, 41, 768]
2021-06-17 17:23:44.906664: I tensorflow/stream_executor/stream.cc:1404] [stream=0x55774c2eb680,impl=0x5577394acab0] did not wait for [stream=0x55774c2eb410,impl=0x5577266661f0]
2021-06-17 17:23:44.906810: E tensorflow/stream_executor/cuda/cuda_driver.cc:1085] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906826: E tensorflow/stream_executor/cuda/cuda_driver.cc:1085] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906841: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906859: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:721] failed to record completion event; therefore, failed to create inter-stream dependency
2021-06-17 17:23:44.906872: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906888: E tensorflow/stream_executor/stream.cc:334] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2021-06-17 17:23:44.906903: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
2021-06-17 17:23:44.906911: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906920: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fec7589a700; host src: 0x7fec55458200; size: 4=0x4
2021-06-17 17:23:44.906934: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted2021-06-17 17:23:44.906946: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fed1b838100; host src: 0x7fe28e26b040; size: 24531156=0x17650d4
Thread 0x00007fec57a63700 (most recent call first):
File "2021-06-17 17:23:44.906960: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fecaa6b1b00; host src: 0x7fec55457a00; size: 164=0xa4
/home/sfalk2021-06-17 17:23:44.906974: E tensorflow/stream_executor/cuda/cuda_driver.cc:1182] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x7fec5545af00; GPU src: 0x7fe75f100d00; size: 31980=0x7cec
/minicon2021-06-17 17:23:44.906987: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
da3/Fatal Python error: eAborted nvs/asr2/lib/python3.8/multiprocessingFatal Python error: /Abortedpool.py"Aborted (core dumped)
It's possible that this has nothing to do with https://github.com/HawkAaron/warp-transducer but it's the only external library I am using in combination with Tensorflow.
See also https://github.com/tensorflow/tensorflow/issues/50326
Hi @stefan-falk, did you resolve the issue ? i have similar problem with tf 2.8.2 + cuda11.2 + warp+rnnt. issue occurs only on multiGPU