Recent change to train_dreambooth causing CUDNN_STATUS_INTERNAL_ERROR (WSL install)
Describe the bug
When using dreambooth in WSL Ubuntu 20.04, a recent change is causing training to fail. Undoing this single change fixes the issue on the most recent checkin. (f94be89)
input_ids = tokenizer.pad(
{"input_ids": input_ids},
- padding="max_length",
- max_length=tokenizer.model_max_length,
+ padding=True,
return_tensors="pt",
).input_ids
I can use this workaround and understand my particular setup may not be supported, but wanted to share in case it was impacting others or if there was something I can do to fix it.
Reproduction
Setup this repo in WSL using instructions here: https://pastebin.com/uE1WcSxD Tweak one step: pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117 Execute training using script: mytraining.txt
Logs
Traceback (most recent call last):
File "/home/username/github/diffusers/examples/dreambooth/train_dreambooth.py", line 824, in <module>
main(args)
File "/home/username/github/diffusers/examples/dreambooth/train_dreambooth.py", line 788, in main
accelerator.backward(loss)
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py", line 882, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
ConvolutionParams
memory_format = ChannelsLast
data_type = CUDNN_DATA_HALF
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7fce222a4920
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 2, 640, 16, 16,
strideA = 163840, 1, 10240, 640,
output: TensorDescriptor 0x7fce300c0c30
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 2, 1280, 16, 16,
strideA = 327680, 1, 20480, 1280,
weight: FilterDescriptor 0x7fce222857c0
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NHWC
nbDims = 4
dimA = 1280, 640, 1, 1,
Pointer addresses:
input: 0x85a600000
output: 0x7814c0000
weight: 0x856c48000
System Info
-
diffusersversion: 0.8.0.dev0 - Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31
- Python version: 3.9.13
- PyTorch version (GPU?): 1.13.0+cu117 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.24.0
I'm having this exact same problem:
import torch torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([2, 320, 64, 64], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv2d(320, 4, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()
ConvolutionParams memory_format = Contiguous data_type = CUDNN_DATA_HALF padding = [1, 1, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0x5dce960 type = CUDNN_DATA_HALF nbDims = 4 dimA = 2, 320, 64, 64, strideA = 1310720, 4096, 64, 1, output: TensorDescriptor 0xa7edd830 type = CUDNN_DATA_HALF nbDims = 4 dimA = 2, 4, 64, 64, strideA = 16384, 4096, 64, 1, weight: FilterDescriptor 0x7f62bc02e570 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 4, 320, 3, 3, Pointer addresses: input: 0x7f5b1a000000 output: 0x7f5cbbdea000 weight: 0x7f5e64dfa000
I fixed it by updating input_ids:
input_ids = tokenizer.pad( {"input_ids": input_ids}, padding="max_length", max_length=tokenizer.model_max_length, # padding=True, return_tensors="pt", ).input_ids