PaLM-pytorch cuDNN error: CUDNN_STATUS_INTERNAL

cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error

Open unwritten opened this issue 3 years ago • 2 comments

code segment below will report error as titled, under multi gpu training

    # rotary embeddings
    positions = self.get_rotary_embedding(n, device)
    q, k = map(lambda t: apply_rotary_pos_emb(positions, t), (q, k))

Apr 15 '22 08:04 unwritten

hmm, are you sure you aren't OOM?

Apr 19 '22 20:04 lucidrains

code segment below will report error as titled, under multi gpu training
    # rotary embeddings
    positions = self.get_rotary_embedding(n, device)
    q, k = map(lambda t: apply_rotary_pos_emb(positions, t), (q, k))

Are you using a specific library for parallel computing? Horovod, PyTorch Lightning, Fairscale, Deepspeed, or PyTorch distributed with model = nn.DataParallel(model)? I have tested parallel GPU use with both Deepspeed and model = nn.DataParallel(model) so far. cuDNN errors can be quite difficult to debug. Have you tried on CPU or using .detach()?

Apr 21 '22 15:04 conceptofmind

PaLM-pytorch PaLM-pytorch copied to clipboard

cuDNN error: CUDNN_STATUS_INTERNAL_ERROR error

PaLM-pytorch
PaLM-pytorch copied to clipboard