Recent change to train_dreambooth causing CUDNN_STATUS_INTERNAL_ERROR (WSL install)

Open astrobread opened this issue 3 years ago • 1 comments

Describe the bug

When using dreambooth in WSL Ubuntu 20.04, a recent change is causing training to fail. Undoing this single change fixes the issue on the most recent checkin. (f94be89)

input_ids = tokenizer.pad(
            {"input_ids": input_ids},
-            padding="max_length",
-            max_length=tokenizer.model_max_length,
+            padding=True,
            return_tensors="pt",
        ).input_ids

I can use this workaround and understand my particular setup may not be supported, but wanted to share in case it was impacting others or if there was something I can do to fix it.

Reproduction

Setup this repo in WSL using instructions here: https://pastebin.com/uE1WcSxD Tweak one step: pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117 Execute training using script: mytraining.txt

Logs

Traceback (most recent call last):
  File "/home/username/github/diffusers/examples/dreambooth/train_dreambooth.py", line 824, in <module>
    main(args)
  File "/home/username/github/diffusers/examples/dreambooth/train_dreambooth.py", line 788, in main
    accelerator.backward(loss)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py", line 882, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

ConvolutionParams
    memory_format = ChannelsLast
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7fce222a4920
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 2, 640, 16, 16,
    strideA = 163840, 1, 10240, 640,
output: TensorDescriptor 0x7fce300c0c30
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 2, 1280, 16, 16,
    strideA = 327680, 1, 20480, 1280,
weight: FilterDescriptor 0x7fce222857c0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NHWC
    nbDims = 4
    dimA = 1280, 640, 1, 1,
Pointer addresses:
    input: 0x85a600000
    output: 0x7814c0000
    weight: 0x856c48000

System Info

diffusers version: 0.8.0.dev0
Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python version: 3.9.13
PyTorch version (GPU?): 1.13.0+cu117 (True)
Huggingface_hub version: 0.10.1
Transformers version: 4.24.0

Nov 08 '22 13:11 astrobread

I'm having this exact same problem:

import torch torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([2, 320, 64, 64], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv2d(320, 4, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams memory_format = Contiguous data_type = CUDNN_DATA_HALF padding = [1, 1, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0x5dce960 type = CUDNN_DATA_HALF nbDims = 4 dimA = 2, 320, 64, 64, strideA = 1310720, 4096, 64, 1, output: TensorDescriptor 0xa7edd830 type = CUDNN_DATA_HALF nbDims = 4 dimA = 2, 4, 64, 64, strideA = 16384, 4096, 64, 1, weight: FilterDescriptor 0x7f62bc02e570 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 4, 320, 3, 3, Pointer addresses: input: 0x7f5b1a000000 output: 0x7f5cbbdea000 weight: 0x7f5e64dfa000

I fixed it by updating input_ids:

input_ids = tokenizer.pad( {"input_ids": input_ids}, padding="max_length", max_length=tokenizer.model_max_length, # padding=True, return_tensors="pt", ).input_ids

Nov 09 '22 09:11 JantineD