PL-BERT icon indicating copy to clipboard operation
PL-BERT copied to clipboard

RuntimeError: CUDA error: device-side assert triggered on criterion

Open junylee11 opened this issue 7 months ago • 0 comments

I saw issues about this error. #28 But, I don't know how to solve this error..

I don't know how to write a code that skips the error. Can you tell me the solution?

Error occured on this code `

accelerator.print('Start training...')

running_loss = 0

for _, batch in enumerate(train_loader):        
    curr_steps += 1
    
    words, labels, phonemes, input_lengths, masked_indices = batch
    text_mask = length_to_mask(torch.Tensor(input_lengths))# .to(device)
    
    tokens_pred, words_pred = bert(phonemes, attention_mask=(~text_mask).int())
    
    loss_vocab = 0
    for _s2s_pred, _text_input, _text_length, _masked_indices in zip(words_pred, words, input_lengths, masked_indices):
        loss_vocab += criterion(_s2s_pred[:_text_length], _text_input[:_text_length]) # Here!!
    loss_vocab /= words.size(0)

`

C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed. Traceback (most recent call last): File "C:\Users\user_\Desktop\PL-BERT-KO\train_infer.py", line 198, in notebook_launcher(train, args=(), num_processes=1) File "C:\Users\user_\anaconda3\envs\PL-BERT-KO\lib\site-packages\accelerate\launchers.py", line 207, in notebook_launcher function(*args) File "C:\Users\user_\Desktop\PL-BERT-KO\train_infer.py", line 147, in train loss_vocab += criterion(_s2s_pred[:_text_length], _text_input[:text_length]) File "C:\Users\user\anaconda3\envs\PL-BERT-KO\lib\site-packages\torch\nn\modules\module.py", line 1518, in wrapped_call_impl return self.call_impl(*args, **kwargs) File "C:\Users\user\anaconda3\envs\PL-BERT-KO\lib\site-packages\torch\nn\modules\module.py", line 1527, in call_impl return forward_call(*args, **kwargs) File "C:\Users\user\anaconda3\envs\PL-BERT-KO\lib\site-packages\torch\nn\modules\loss.py", line 1179, in forward return F.cross_entropy(input, target, weight=self.weight, File "C:\Users\user\anaconda3\envs\PL-BERT-KO\lib\site-packages\torch\nn\functional.py", line 3053, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

junylee11 avatar Jan 12 '24 10:01 junylee11