sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

训练488万张图片时中间突然出现OS.ERROR

Open 2575044704 opened this issue 8 months ago • 4 comments

When I was training model on 2x A100 80G machine, a few time later afrer start, there's an error occurred:

steps:   0%|                                                                                         | 373/381280 [1:58:19<2014:00:30, 19.03s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:25<2010:03:41, 19.00s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:25<2010:03:41, 19.00s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:30<2011:31:38, 19.01s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:35<2012:59:34, 19.03s/it, avr_loss=0.0848][rank1]: Traceback (most recent call last):
[rank1]:   File "/sd-scripts/sdxl_train_network.py", line 185, in <module>
[rank1]:     trainer.train(args)
[rank1]:   File "/sd-scripts/train_network.py", line 806, in train
[rank1]:     for step, batch in enumerate(train_dataloader):
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/data_loader.py", line 458, in __iter__
[rank1]:     next_batch = next(dataloader_iter)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
[rank1]:     data = self._next_data()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
[rank1]:     return self._process_data(data)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
[rank1]:     data.reraise()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
[rank1]:     raise exception
[rank1]: OSError: Caught OSError in DataLoader worker process 4.
[rank1]: Original Traceback (most recent call last):
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[rank1]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
[rank1]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
[rank1]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 348, in __getitem__
[rank1]:     return self.datasets[dataset_idx][sample_idx]
[rank1]:   File "/sd-scripts/library/train_util.py", line 1207, in __getitem__
[rank1]:     img, face_cx, face_cy, face_w, face_h = self.load_image_with_face_info(subset, image_info.absolute_path)
[rank1]:   File "/sd-scripts/library/train_util.py", line 1092, in load_image_with_face_info
[rank1]:     img = load_image(image_path)
[rank1]:   File "/sd-scripts/library/train_util.py", line 2352, in load_image
[rank1]:     img = np.array(image, np.uint8)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/Image.py", line 696, in __array_interface__
[rank1]:     new["data"] = self.tobytes()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/Image.py", line 755, in tobytes
[rank1]:     self.load()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/WebPImagePlugin.py", line 160, in load
[rank1]:     data, timestamp, duration = self._get_next()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/WebPImagePlugin.py", line 127, in _get_next
[rank1]:     ret = self._decoder.get_next()
[rank1]: OSError: failed to read next frame


steps:   0%|                                                                                         | 374/381280 [1:58:40<2014:26:18, 19.04s/it, avr_loss=0.0848]W0624 22:22:13.858000 140247365268672 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 75699 closing signal SIGTERM
E0624 22:22:14.275000 140247365268672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 75700) of binary: /root/.conda/envs/lora/bin/python3
Traceback (most recent call last):
  File "/root/.conda/envs/lora/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.conda/envs/lora/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1027, in <module>
    main()
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in main
    launch_command(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_network.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-24_22:22:13
  host      : intern-studio-40021203
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 75700)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I hope the author can find the reason of this problem, thanks!!

2575044704 avatar Jun 24 '24 14:06 2575044704