fast-stable-diffusion RuntimeError: CUDA error: an illegal memory access was encountered

I ran it successfuly a couple times yesterday and enjoyed the results. Today I wanted to use a different collection of photos, try a couple settings changes. But I keep getting "Something went wrong" soon after the training begins. Suggestion? Thanks for the tool, and the support.

2022-10-25 00:49:24.870199: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. Steps: 0% 2/4000 [00:11<5:37:55, 5.07s/it, loss=0.0975, lr=1e-6]Traceback (most recent call last): File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 694, in main() File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 629, in main accelerator.backward(loss) File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 884, in backward loss.backward(**kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Steps: 0% 2/4000 [00:12<7:12:37, 6.49s/it, loss=0.0975, lr=1e-6] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--save_starting_step=500', '--save_n_steps=0', '--train_text_encoder', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/data/bev3', '--output_dir=/content/models/bev3', '--instance_prompt=photo of bev3 Woman', '--seed=75576', '--resolution=512', '--mixed_precision=no', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=1e-6', '--lr_scheduler=constant', '--center_crop', '--lr_warmup_steps=0', '--max_train_steps=4000']' returned non-zero exit status 1. Something went wrong

Oct 25 '22 06:10 nfrhtp

run !nvdia-smi and see what GPU you have

Oct 25 '22 07:10 TheLastBen

In Terminal on the page, when I enter !nvdia-smi I get: -bash: !nvdia: event not found

Oct 25 '22 07:10 nfrhtp

Tue Oct 25 08:08:39 2022+-----------------------------------------------------------------------------+| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 || N/A 30C P0 45W / 400W | 0MiB / 40536MiB | 0% Default || | | Disabled |+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+

Oct 25 '22 08:10 nfrhtp

Make sure you uncheck "fp16"

Oct 25 '22 08:10 TheLastBen

Today, after loading a fresh instance, with fp16 unchecked, I tried both the new fast method (renaming files) and the older. Both times I get the RuntimeError: CUDA error soon after training begins:

2022-10-25 19:19:04.947926: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. Steps: 0% 3/600 [00:12<30:11, 3.03s/it, loss=0.0979, lr=1e-6] bev Traceback (most recent call last): File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 694, in main() File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 629, in main accelerator.backward(loss) File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 884, in backward loss.backward(**kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Steps: 0% 3/600 [00:13<44:02, 4.43s/it, loss=0.0979, lr=1e-6] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--image_captions_filename', '--save_starting_step=500', '--save_n_steps=0', '--train_text_encoder', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/data/bev3-Oct25-noon', '--output_dir=/content/models/bev3-Oct25-noon', '--instance_prompt=', '--seed=96576', '--resolution=512', '--mixed_precision=no', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=1e-6', '--lr_scheduler=constant', '--center_crop', '--lr_warmup_steps=0', '--max_train_steps=600']' returned non-zero exit status 1. Something went wrong

Oct 25 '22 19:10 nfrhtp

It's due to xformers installation on the A100, I will fix it very soon, try getting a different colab GPU in the meantime

Oct 25 '22 20:10 TheLastBen

OK, I will look forward to that! In the meantime, is there a "safest" setting for the colab GPU and RAM, that is most likely to succeed even if slower? Thanks again.

Oct 25 '22 20:10 nfrhtp

@TheLastBen You can get a A100 everytime on Colab+ by selecting this

Oct 25 '22 20:10 djbielejeski

So if I do NOT choose "Premium" GPU the process is more likely to succeed?

Oct 25 '22 20:10 nfrhtp

try again now and see if it works, I implement a workaround

Oct 25 '22 23:10 TheLastBen

Upon seeing your last comment I interrupted a training using standard settings for GPU & RAM. It was set to take about 50 minutes for training (1500 steps, 98 images). With the GPU set to Premium and max RAM it is training now and should take about 9 minutes. Looking good!

Oct 26 '22 00:10 nfrhtp

What would be the quickest procedure to follow to be able to come back in a couple days and make use of the same trained model, without going through the whole procedure? Or perhaps with such quick training it makes sense to run a fresh instance of the process.

Oct 26 '22 00:10 nfrhtp

if you're training on a different GPU than the A100, check the box fp16, 1500 steps would take 30 min

Oct 26 '22 00:10 TheLastBen

in the main repo, use the A1111 colab, and simply put the path to the trained model in the third cell and it will use it

Oct 26 '22 00:10 TheLastBen

Thanks again, it is working great. One more question...

I made a trained model using 2000 steps, saving every 500. Why is the final ckpt 4GB while the others (including 2000.ckpt) 2GB? How should I expect that to affect results?

Oct 28 '22 18:10 nfrhtp

you unchecked the box "fp16", so the final result would be in "fp32" but the savepoints are kept "fp16" to save space, you should always use "fp16" it's double the speed and the same quality

Oct 28 '22 18:10 TheLastBen