SimpleTuner
SimpleTuner copied to clipboard
vae_cache_clear_each_epoch causes random training crashes
When setting vae_cache_clear_each_epoch: true
in the multidatabackened.json
file training randomly stops with the stack trace below. Reproduced on multiple machines, different epochs and images reported as the issue.
(id=images-1024) Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set.
Problematic images: ['/workspace/images/8414196_0539_213156034.jpg']
Traceback (most recent call last):
File "/code/SimpleTuner/train.py", line 2509, in <module>
main()
File "/code/SimpleTuner/train.py", line 1524, in main
batch = iterator_fn(step, *iterator_args)
File "/code/SimpleTuner/helpers/data_backend/factory.py", line 1252, in random_dataloader_iterator
return next(chosen_iter)
File "/code/SimpleTuner/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/code/SimpleTuner/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/code/SimpleTuner/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/code/SimpleTuner/helpers/data_backend/factory.py", line 875, in <lambda>
collate_fn=lambda examples: collate_fn(examples),
File "/code/SimpleTuner/helpers/training/collate.py", line 403, in collate_fn
latent_batch = compute_latents(filepaths, data_backend_id)
File "/code/SimpleTuner/helpers/training/collate.py", line 146, in compute_latents
latents = list(
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/code/SimpleTuner/helpers/training/collate.py", line 105, in fetch_latent
latent = StateTracker.get_vaecache(id=data_backend_id).retrieve_from_cache(fp)
File "/code/SimpleTuner/helpers/caching/vae.py", line 220, in retrieve_from_cache
return self.encode_images([None], [filepath])[0]
File "/code/SimpleTuner/helpers/caching/vae.py", line 498, in encode_images
raise Exception(
Exception: (id=images-1024) Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set.
Problematic images: ['/workspace/images/8414196_0539_213156034.jpg']
Epoch 3/100, Steps: 3%|▉ | 3331/100600 [10:35:05<309:05:26, 11.44s/it, lr=9e-5, step_loss=0.389]