RuntimeError: CUDA error: an illegal memory access was encountered
I ran it successfuly a couple times yesterday and enjoyed the results. Today I wanted to use a different collection of photos, try a couple settings changes. But I keep getting "Something went wrong" soon after the training begins. Suggestion? Thanks for the tool, and the support.
2022-10-25 00:49:24.870199: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
Steps: 0% 2/4000 [00:11<5:37:55, 5.07s/it, loss=0.0975, lr=1e-6]Traceback (most recent call last):
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 694, in
run !nvdia-smi and see what GPU you have
In Terminal on the page, when I enter !nvdia-smi I get: -bash: !nvdia: event not found
Tue Oct 25 08:08:39 2022+-----------------------------------------------------------------------------+| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 || N/A 30C P0 45W / 400W | 0MiB / 40536MiB | 0% Default || | | Disabled |+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+
Make sure you uncheck "fp16"
Today, after loading a fresh instance, with fp16 unchecked, I tried both the new fast method (renaming files) and the older. Both times I get the RuntimeError: CUDA error soon after training begins:
2022-10-25 19:19:04.947926: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
Steps: 0% 3/600 [00:12<30:11, 3.03s/it, loss=0.0979, lr=1e-6] bev Traceback (most recent call last):
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 694, in
It's due to xformers installation on the A100, I will fix it very soon, try getting a different colab GPU in the meantime
OK, I will look forward to that! In the meantime, is there a "safest" setting for the colab GPU and RAM, that is most likely to succeed even if slower? Thanks again.
@TheLastBen You can get a A100 everytime on Colab+ by selecting this

So if I do NOT choose "Premium" GPU the process is more likely to succeed?
try again now and see if it works, I implement a workaround
Upon seeing your last comment I interrupted a training using standard settings for GPU & RAM. It was set to take about 50 minutes for training (1500 steps, 98 images). With the GPU set to Premium and max RAM it is training now and should take about 9 minutes. Looking good!
What would be the quickest procedure to follow to be able to come back in a couple days and make use of the same trained model, without going through the whole procedure? Or perhaps with such quick training it makes sense to run a fresh instance of the process.
if you're training on a different GPU than the A100, check the box fp16, 1500 steps would take 30 min
in the main repo, use the A1111 colab, and simply put the path to the trained model in the third cell and it will use it
Thanks again, it is working great. One more question...
I made a trained model using 2000 steps, saving every 500. Why is the final ckpt 4GB while the others (including 2000.ckpt) 2GB? How should I expect that to affect results?
you unchecked the box "fp16", so the final result would be in "fp32" but the savepoints are kept "fp16" to save space, you should always use "fp16" it's double the speed and the same quality