Dreambooth-Stable-Diffusion icon indicating copy to clipboard operation
Dreambooth-Stable-Diffusion copied to clipboard

training gets "Killed" ? not sure why

Open drfleetles opened this issue 3 years ago • 22 comments

I followed the steps in the video Aitrepreneur made and I keep coming across this error image

Not sure what is happening

drfleetles avatar Sep 30 '22 00:09 drfleetles

Mine has also been doing this. Any fix you've found?

Puremask avatar Sep 30 '22 02:09 Puremask

same here!

thommorais avatar Sep 30 '22 03:09 thommorais

FIX FOUND: In Jupyter Lab, open a new tab, open a terminal and type "ps aux" to find the PID for relauncher and webui, kill both process and restart the training.

thommorais avatar Sep 30 '22 04:09 thommorais

on the youtube there is a thread that are discussing this. https://www.youtube.com/watch?v=7m__xadX0z0&lc=UgzHSfUz5CihSyO8nzN4AaABAg.9gVbwSnCOg19gW16WBw0RG

thommorais avatar Sep 30 '22 04:09 thommorais

image

drfleetles avatar Sep 30 '22 04:09 drfleetles

image After a couple of steps it still get killed

drfleetles avatar Sep 30 '22 04:09 drfleetles

you guys have even amount of pics? I had 64 on vast yesterday and trained till 2000 steps no issues

1blackbar avatar Sep 30 '22 11:09 1blackbar

Same here... Worked before, stopped working. Now I have started from the beginning and now I get this:

Epoch 0: 0%| | 1/2020 [00:03<1:44:35, 3.11s/it, loss=0.0379, v_num=0, train/lHere comes the checkpoint... Killed

Yes: Even amount of images (0 to 19 = 20 pictures)

Kallamamran avatar Sep 30 '22 11:09 Kallamamran

Hey! I had the same problem several times, and this could just be a coincidence, but when using 2xGPUs and setting --gpus 1 it runs all the way to the end.

roar-emaus avatar Sep 30 '22 14:09 roar-emaus

Same here... Worked before, stopped working. Now I have started from the beginning and now I get this:

Epoch 0: 0%| | 1/2020 [00:03<1:44:35, 3.11s/it, loss=0.0379, v_num=0, train/lHere comes the checkpoint... Killed

Yes: Even amount of images (0 to 19 = 20 pictures)

Try the first solution that I posted, the image of the YouTube comment

drfleetles avatar Sep 30 '22 23:09 drfleetles

ok, someone commented on Aitrepreneur latest video image

drfleetles avatar Oct 01 '22 00:10 drfleetles

ok, someone commented on Aitrepreneur latest video image

I did this and the issue persists with an A5000 on RunPod

vlameiras avatar Oct 01 '22 15:10 vlameiras

ok, someone commented on Aitrepreneur latest video image

I did this and the issue persists with an A5000 on RunPod

Sorry to hear that, it worked for me. Have you tried the first solution where you kill the programs?

drfleetles avatar Oct 01 '22 22:10 drfleetles

I did try to kill the processes as well. It ended up working when I tried using a community pod with an RTX 3900 gpu 😃

vlameiras avatar Oct 02 '22 07:10 vlameiras

Some extra info on this, as this is happening for me too.

Before running, check nvidia-smi in terminal, memory should be 1MiB / xxxxMiB, if its higher, you have to kill something (ps aux and kill pid).

But even then, i do get OOM kills. It is not after one iteration, its sometimes after a 100, sometimes after 200. This is very annoying as checkpointing right now is every 500. I am not sure what to do about this. It seems pretty random, on some runs I get no issues, on others i cant seem to get through. It seems related to the actual gpu you use, and then random luck.

EDIT: I seem to be having a lot more luck with the RTX 3090 and not so much with the A5000.

witzatom avatar Oct 02 '22 16:10 witzatom

If running via runpod, run a PyTorch instance instead of a stable diffusion one. I stopped getting killed processes after that. Tested on 3090 and A5000.

ddevillier avatar Oct 03 '22 02:10 ddevillier

Yea, i always run pytorch instead of sd and i still have issues on the a5000, it seems its just random.

witzatom avatar Oct 04 '22 09:10 witzatom

Same experience as wizatom on both community and secure A5000 in the last few days.

urbanyeti avatar Oct 05 '22 07:10 urbanyeti

Someone can write the sentence to kill please on Terminal please ? Thank you in advance.

Larcebeau avatar Oct 05 '22 09:10 Larcebeau

@Larcebeau , this command worked for me. You can run it in a jupyter cell, no need to open a separate terminal:

!ps aux | grep 'webui\|relauncher' | head -3 | awk '{print $2}' | xargs kill -9

What it does:

ps aux # list processes
grep 'webui\|relauncher' # find the processes corresponding to webui and relauncher
head -3 # grab the first 3 lines (1 relauncher process and 2 webui processes). This may not be necessary
awk '{print $2}' # read the second column, which has the process ID
xargs kill -9 # pass the resulting pids to kill which will kill the processes.

When you run ps aux (or !ps aux in jupyter) afterwards, you should see no processes for webui or relauncher.

jason-curtis avatar Oct 06 '22 00:10 jason-curtis

thank you mate ! :)

Larcebeau avatar Oct 06 '22 09:10 Larcebeau

For anyone running into this for which the above doesn't help, this can be caused by a number of things, but the most likely is due to the machine running out of RAM -- not VRAM, but traditional main RAM. (Pretty sure the out-of-VRAM error is quite different.) The above suggestions will indeed free up a good bit of RAM (probably VRAM too) which can/will help. But if you still have the problem try renting a machine with more RAM. (Or, if running locally, increasing swap space or upgrading your hardware.)

jwatzman avatar Oct 18 '22 18:10 jwatzman

For me the issue on RunPod was that when you start a pod, by default, it's set to 0 GPUs. If you click through it mindlessly, you will encounter this issue. You must select 1x GPU when starting.

hirad avatar Nov 14 '22 19:11 hirad