fast-stable-diffusion icon indicating copy to clipboard operation
fast-stable-diffusion copied to clipboard

Dreambooth: No checkpoints created and loop

Open elBlacksmith opened this issue 2 years ago • 15 comments

Hi,

I tried to train a model and despite having selected the checkpoints option none has been created at my Google Drive root folder. Also it didn´t created the final CKPT, and the notebook started again creating automatically a second one: https://imgur.com/a/0b8aWVG

My config was like this:

Start DreamBooth
Resume_Training:

If you're not satisfied with the result, check this box, run again the cell and it will continue training the current model.
Training_Steps:
7500
Total Steps = Number of Instance images * 200, if you use 30 images, use 6000 steps, if you're not satisfied with the result, resume training for another 500 steps, and so on ...
Seed:
Insert text here
Leave empty for a random seed.
Resolution:

512
Higher resolution = Higher quality, make sure the instance images are cropped to this selected size (or larger), if you're getting memory issues, check the box below (slower speed but memory effecient) :
Reduce_memory_usage:

fp16:

Enable/disable half-precision, disabling it will double the training time and produce 4.7Gb checkpoints.
Enable_text_encoder_training:

At least 10% of the total training steps are needed, it doesn't matter if they are at the beginning or in the middle or the end, in case you're training the model multiple times.
For example you can devide 5%, 5%, 5% on 3 training runs on the model, or 0%, 0%, 15%, given that 15% will cover the total training steps count (15% of 200 steps is not enough).
Enter the % of the total steps for which to train the text_encoder
Train_text_encoder_for:
100
Keep the % low for better style transfer, more training steps will be necessary for good results.
Higher % will give more weight to the instance, it gives stronger results at lower steps count, but harder to stylize,
Save_Checkpoint_Every_n_Steps:

Save_Checkpoint_Every:
500
Minimum 200 steps between each save.
Start_saving_from_the_step:
1500
Start saving intermediary checkpoints from this step.

Thanks!

elBlacksmith avatar Nov 13 '22 01:11 elBlacksmith

the intermediary checkpoints are saved immediately in gdrive in the session's folder

TheLastBen avatar Nov 13 '22 12:11 TheLastBen

I have the same problem. I have tried running the Colab 4 different times now, and no ckpt files were created.

When doing training a week ago (November 5) checkpoints were created every 500 steps as requested. They did not end up in the session folder then, but in the root folder.

Anyway, thanks for an amazing colab! The results from my first training on pictures of myself are incredible.

Xtreamer avatar Nov 13 '22 14:11 Xtreamer

the intermediary checkpoints are saved immediately in gdrive in the session's folder

Mmm. But there is no ckpt files there. I see this: Screenshot_20221113160106

I even looked by creation date on my Google drive, but I didn't saw anything similar.

elBlacksmith avatar Nov 13 '22 15:11 elBlacksmith

I'll check that out

TheLastBen avatar Nov 13 '22 15:11 TheLastBen

I'll check that out

Thanks, I appreciate it.

elBlacksmith avatar Nov 13 '22 17:11 elBlacksmith

Same thing happening to me. Just done 3800 step training, set it to save every 500 from 500 steps and nothing is there. At the end of training, it just started again. Any update on how to fix it?

MancV21 avatar Nov 17 '22 23:11 MancV21

New or old method ?

TheLastBen avatar Nov 18 '22 00:11 TheLastBen

Same here, latest fast-DreamBooth.ipynb file 80GB free gdrive space Save_Checkpoint_Every_n_Steps checked and set to 500

10 hours of training, not a single checkpoint generated, it used to work the last time I tried a couple of weeks ago Lots of paid compute units wasted this month.

ANTONIOPSD avatar Nov 18 '22 12:11 ANTONIOPSD

try again now with a test run, set the save to the step 205 and see if it works

TheLastBen avatar Nov 18 '22 13:11 TheLastBen

try again now with a test run, set the save to the step 205 and see if it works

Same problem, by the way, my images are cropped to 640 and also enabled the Female faces option

ANTONIOPSD avatar Nov 18 '22 14:11 ANTONIOPSD

I'll check that out

So I tried again now that I'm not busy, and I found that I was mistaken.

There is no loop, I didn't used the fast method for a while and I got confused with the "Enable_text_encoder_training". As is set by default in the value 100 and goes before the training itself I thought it was the training because it was taking the same time. And that's also why there was no saved check point on the "sessions" folder. I like that now it's not on the root folder, but I think it can be a bit confusing. I don't know if there is a warning on the notebook on where to look.

I tried again, this time changing the 100 for 20%, and it worked, even the saved check points.

Now I don't know what does the "Enable_text_encoder_training" function. I'll do a bit more research.

Once again, thank you for your work.

elBlacksmith avatar Nov 18 '22 15:11 elBlacksmith

Same problem, by the way, my images are cropped to 640 and also enabled the Female faces option

if you enable "contains_faces" it will not save the checkpoints until the training of the UNet starts, it will not save checkpoints during the training of the text_encoder

TheLastBen avatar Nov 18 '22 15:11 TheLastBen

Oh... This is the problem of course. When it worked before I believe the text_encoder field was prefilled with something like 35. The new default, 100 (100%?) means that no intermediate checkpoints will be saved. I just went with the default as I thought it was chosen deliberately.

I am using 158 images and am therefore running 15800 iterations. This is too high for the free version of Colab, but if I have intermediate checkpoints I can start testing the one with the highest number of iterations.

Xtreamer avatar Nov 18 '22 16:11 Xtreamer

@Xtreamer if you set it to 100, it will train 15800s the textenc and 15800s the unet, set it to something like 30%.

don't use the "contains_faces" option, it's experimental and doesn't give much results

TheLastBen avatar Nov 18 '22 17:11 TheLastBen

Same problem, by the way, my images are cropped to 640 and also enabled the Female faces option

if you enable "contains_faces" it will not save the checkpoints until the training of the UNet starts, it will not save checkpoints during the training of the text_encoder

ohh, ok I understand now, thanks!

ANTONIOPSD avatar Nov 18 '22 22:11 ANTONIOPSD