SimpleTuner icon indicating copy to clipboard operation
SimpleTuner copied to clipboard

vae_cache_clear_each_epoch causes random training crashes

Open riffmaster-2001 opened this issue 6 months ago • 0 comments

When setting vae_cache_clear_each_epoch: true in the multidatabackened.json file training randomly stops with the stack trace below. Reproduced on multiple machines, different epochs and images reported as the issue.

(id=images-1024) Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set.
Problematic images: ['/workspace/images/8414196_0539_213156034.jpg']
Traceback (most recent call last):
  File "/code/SimpleTuner/train.py", line 2509, in <module>
    main()
  File "/code/SimpleTuner/train.py", line 1524, in main
    batch = iterator_fn(step, *iterator_args)
  File "/code/SimpleTuner/helpers/data_backend/factory.py", line 1252, in random_dataloader_iterator
    return next(chosen_iter)
  File "/code/SimpleTuner/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/code/SimpleTuner/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/code/SimpleTuner/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
  File "/code/SimpleTuner/helpers/data_backend/factory.py", line 875, in <lambda>
    collate_fn=lambda examples: collate_fn(examples),
  File "/code/SimpleTuner/helpers/training/collate.py", line 403, in collate_fn
    latent_batch = compute_latents(filepaths, data_backend_id)
  File "/code/SimpleTuner/helpers/training/collate.py", line 146, in compute_latents
    latents = list(
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/code/SimpleTuner/helpers/training/collate.py", line 105, in fetch_latent
    latent = StateTracker.get_vaecache(id=data_backend_id).retrieve_from_cache(fp)
  File "/code/SimpleTuner/helpers/caching/vae.py", line 220, in retrieve_from_cache
    return self.encode_images([None], [filepath])[0]
  File "/code/SimpleTuner/helpers/caching/vae.py", line 498, in encode_images
    raise Exception(
Exception: (id=images-1024) Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set.
Problematic images: ['/workspace/images/8414196_0539_213156034.jpg']

Epoch 3/100, Steps:   3%|▉                            | 3331/100600 [10:35:05<309:05:26, 11.44s/it, lr=9e-5, step_loss=0.389]

riffmaster-2001 avatar Aug 22 '24 14:08 riffmaster-2001