stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

Fixes #4137 caused by race condition in training when VAE is unloaded

Open MarkovInequality opened this issue 3 years ago • 2 comments

Fix #4137

When training both TI and HN, the progress bar callback can call set_current_image in a separate thread. This poses a problem when "Move VAE and CLIP to RAM when training if possible" is checked and "Show image creation progress every N sampling steps" is not 0, as set_current_image attempts to call sd_samplers.sample_to_image which expects the VAE to be in GPU memory. However, the VAE can either be on the CPU or being copied from one device to another. This leads to a race condition where errors can occur during training.

This fixes the race condition by setting parallel_processing_allowed to false during the duration of training, so set_current_image does not call sd_samplers.sample_to_image

MarkovInequality avatar Nov 04 '22 09:11 MarkovInequality

This fix is important for training, and works for me. I don't understand why @AUTOMATIC1111 didn't allow it

mykeehu avatar Nov 05 '22 18:11 mykeehu

Bumping this, as my TI training has an error when VAE is unloaded and this seems to be a fix for my issue. The training does continue without this fix, but the UI gets stuck and no more sample images get sent to the UI. Please review.

antis0007 avatar Nov 07 '22 02:11 antis0007

@AUTOMATIC1111 we would like you to enable this fix because the embedding training is still giving an error. Thank you in advance!

Update 1: As I see, the content of the file has changed since then, so I don't know where to insert the necessary lines :( @MarkovInequality please check the current status and modify it! Thank you!

Update 2: I'm not a programmer, but I put the lines in the textual_inversion.py file here, I hope it's in the right place:

  • line 279. from:
dl = modules.textual_inversion.dataset.PersonalizedDataLoader(ds, latent_sampling_method=latent_sampling_method, batch_size=ds.batch_size, pin_memory=pin_memory)

old_parallel_processing_allowed = shared.parallel_processing_allowed

if unload:
    shared.parallel_processing_allowed = False
    shared.sd_model.first_stage_model.to(devices.cpu)
  • line 452. from:
finally:
    pbar.leave = False
    pbar.close()
    shared.sd_model.first_stage_model.to(devices.device)
    shared.parallel_processing_allowed = old_parallel_processing_allowed

I'm going to start an Embedding training and see if it works...

Update 3: the training with the previous modification ran flawlessly. Tested. :)

Update 4: I ran one more training and it ran fine with the modification, so it should work with the new code.

mykeehu avatar Nov 30 '22 15:11 mykeehu