Training stucked after Epoch 0
Everything of epoch 0 is okay, but after that, at the beginning of epoch 1, it's stucked.
Then I restart and try to resume. This time it raise segmentation fault
Everything of epoch 0 is okay, but after that, at the beginning of epoch 1, it's stucked.
![]()
just now I did a small test, also stuck at a new epoch. let me check
Everything of epoch 0 is okay, but after that, at the beginning of epoch 1, it's stucked.
just now I did a small test, also stuck at a new epoch. let me check
Did you find why this happen?
Some problems with the data loader, still working on the solutions
This issue is stale because it has been open for 7 days with no activity.
@xilanhua12138 At this point, I would suggest:
- set
pin_memory_cache_pre_alloc_numels = Noneincfgortrain.py. - set
pin_memory = Falsewhen initializing dataloader intrain.py.
Note: you might still find the training crashes (as you posted above) after a few epochs but it shouldn't hang after you disable pin-mem.
The crash seems to be pin-mem-related modules, but it's still quite tricky to figure out the cause as the behavior appears to be a bit random.
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.
hello @botbw , is there any new advice?
@xilanhua12138 At this point, I would suggest:
- set
pin_memory_cache_pre_alloc_numels = Noneincfgortrain.py.- set
pin_memory = Falsewhen initializing dataloader intrain.py.Note: you might still find the training crashes (as you posted above) after a few epochs but it shouldn't hang after you disable pin-mem.
The crash seems to be pin-mem-related modules, but it's still quite tricky to figure out the cause as the behavior appears to be a bit random.
according to the advice,it gets segmentation after epoch 0:
[2025-06-17 03:17:07] Beginning epoch 0...
Epoch 0: 94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 16/17 [00:39<00:02, 2.49s/it, loss=0.0797, global_grad_norm=0.219, step=9, global_step=9, lr=5e-5]
[2025-06-17 03:17:47] Building buckets using 64 workers...
[2025-06-17 03:17:48] Bucket Info:
[2025-06-17 03:17:48] Bucket [#sample, #batch] by aspect ratio:
[2025-06-17 03:17:48] (16:9): #sample: 68, #batch: 68
[2025-06-17 03:17:48] ===== Image Info =====
[2025-06-17 03:17:48] Image Bucket by HxWxT:
[2025-06-17 03:17:48] --------------------------------
[2025-06-17 03:17:48] #image sample: 0, #image batch: 0
[2025-06-17 03:17:48] ===== Video Info =====
[2025-06-17 03:17:48] Video Bucket by HxWxT:
[2025-06-17 03:17:48] ('256px', 97): #sample: 68, #batch: 68
[2025-06-17 03:17:48] --------------------------------
[2025-06-17 03:17:48] #video sample: 68, #video batch: 68
[2025-06-17 03:17:48] ===== Summary =====
[2025-06-17 03:17:48] #non-empty buckets: 1
[2025-06-17 03:17:48] Img/Vid sample ratio: 0.00
[2025-06-17 03:17:48] Img/Vid batch ratio: 0.00
[2025-06-17 03:17:48] vid batch 256: 68, vid batch 768: 0
[2025-06-17 03:17:48] Vid batch ratio (256px/768px): 0.00
[2025-06-17 03:17:48] #training sample: 68, #training batch: 68
[2025-06-17 03:17:48] Beginning epoch 1...
Epoch 1: 0%| | 0/17 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Epoch 1: 0%| | 0/17 [00:01<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1131, in _try_get_data
[rank0]: data = self._data_queue.get(timeout=timeout)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/multiprocessing/queues.py", line 113, in get
[rank0]: if not self._poll(timeout):
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 257, in poll
[rank0]: return self._poll(timeout)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 440, in _poll
[rank0]: r = wait([self], timeout)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 948, in wait
[rank0]: ready = selector.select(timeout)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/selectors.py", line 415, in select
[rank0]: fd_event_list = self._selector.poll(timeout)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
[rank0]: _error_if_any_worker_fails()
[rank0]: RuntimeError: DataLoader worker (pid 8864) is killed by signal: Segmentation fault.
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/Open-Sora/scripts/diffusion/train.py", line 654, in <module>
[rank0]: main()
[rank0]: File "/home/Open-Sora/scripts/diffusion/train.py", line 529, in main
[rank0]: batch_, step_, pinned_video_ = fetch_data()
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/home/Open-Sora/scripts/diffusion/train.py", line 523, in fetch_data
[rank0]: step, batch = next(pbar_iter)
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
[rank0]: for obj in iterable:
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]: data = self._next_data()
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1327, in _next_data
[rank0]: idx, data = self._get_data()
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1293, in _get_data
[rank0]: success, data = self._try_get_data()
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1144, in _try_get_data
[rank0]: raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
[rank0]: RuntimeError: DataLoader worker (pid(s) 8864) exited unexpectedly
[rank0]:[W617 03:17:50.058147529 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0617 03:17:51.687000 140319081027392 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 44 closing signal SIGTERM
W0617 03:17:51.689000 140319081027392 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 45 closing signal SIGTERM
W0617 03:17:51.696000 140319081027392 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 46 closing signal SIGTERM
E0617 03:17:51.725000 140319081027392 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 43) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/diffusion/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-06-17_03:17:51
host : app-trainingjob-ltm-lora5-gpajc-job-0-0.app-trainingjob-ltm-lora5-gpajc.1886048269027160066.svc.cluster.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 43)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html