Fixes #4137 caused by race condition in training when VAE is unloaded
Fix #4137
When training both TI and HN, the progress bar callback can call set_current_image in a separate thread. This poses a problem when "Move VAE and CLIP to RAM when training if possible" is checked and "Show image creation progress every N sampling steps" is not 0, as set_current_image attempts to call sd_samplers.sample_to_image which expects the VAE to be in GPU memory. However, the VAE can either be on the CPU or being copied from one device to another. This leads to a race condition where errors can occur during training.
This fixes the race condition by setting parallel_processing_allowed to false during the duration of training, so set_current_image does not call sd_samplers.sample_to_image
This fix is important for training, and works for me. I don't understand why @AUTOMATIC1111 didn't allow it
Bumping this, as my TI training has an error when VAE is unloaded and this seems to be a fix for my issue. The training does continue without this fix, but the UI gets stuck and no more sample images get sent to the UI. Please review.
@AUTOMATIC1111 we would like you to enable this fix because the embedding training is still giving an error. Thank you in advance!
Update 1: As I see, the content of the file has changed since then, so I don't know where to insert the necessary lines :( @MarkovInequality please check the current status and modify it! Thank you!
Update 2: I'm not a programmer, but I put the lines in the textual_inversion.py file here, I hope it's in the right place:
- line 279. from:
dl = modules.textual_inversion.dataset.PersonalizedDataLoader(ds, latent_sampling_method=latent_sampling_method, batch_size=ds.batch_size, pin_memory=pin_memory) old_parallel_processing_allowed = shared.parallel_processing_allowed if unload: shared.parallel_processing_allowed = False shared.sd_model.first_stage_model.to(devices.cpu)
- line 452. from:
finally: pbar.leave = False pbar.close() shared.sd_model.first_stage_model.to(devices.device) shared.parallel_processing_allowed = old_parallel_processing_allowed
I'm going to start an Embedding training and see if it works...
Update 3: the training with the previous modification ran flawlessly. Tested. :)
Update 4: I ran one more training and it ran fine with the modification, so it should work with the new code.