Open-Sora Training stucked after Epoch 0

Everything of epoch 0 is okay, but after that, at the beginning of epoch 1, it's stucked.

Mar 14 '25 08:03 xilanhua12138

Then I restart and try to resume. This time it raise segmentation fault

Mar 14 '25 08:03 xilanhua12138

Everything of epoch 0 is okay, but after that, at the beginning of epoch 1, it's stucked.

just now I did a small test, also stuck at a new epoch. let me check

Mar 14 '25 09:03 SimonWXW

Everything of epoch 0 is okay, but after that, at the beginning of epoch 1, it's stucked.

just now I did a small test, also stuck at a new epoch. let me check

Did you find why this happen？

Mar 15 '25 06:03 xilanhua12138

Some problems with the data loader, still working on the solutions

Mar 17 '25 01:03 SimonWXW

This issue is stale because it has been open for 7 days with no activity.

Mar 24 '25 02:03 github-actions[bot]

@xilanhua12138 At this point, I would suggest:

set pin_memory_cache_pre_alloc_numels = None in cfg or train.py.
set pin_memory = False when initializing dataloader in train.py.

Note: you might still find the training crashes (as you posted above) after a few epochs but it shouldn't hang after you disable pin-mem.

The crash seems to be pin-mem-related modules, but it's still quite tricky to figure out the cause as the behavior appears to be a bit random.

Mar 25 '25 06:03 botbw

This issue is stale because it has been open for 7 days with no activity.

Apr 03 '25 02:04 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Apr 10 '25 02:04 github-actions[bot]

hello @botbw , is there any new advice？

Jun 17 '25 02:06 ltm920716

@xilanhua12138 At this point, I would suggest:

set pin_memory_cache_pre_alloc_numels = None in cfg or train.py.

set pin_memory = False when initializing dataloader in train.py.

Note: you might still find the training crashes (as you posted above) after a few epochs but it shouldn't hang after you disable pin-mem.

The crash seems to be pin-mem-related modules, but it's still quite tricky to figure out the cause as the behavior appears to be a bit random.

according to the advice，it gets segmentation after epoch 0：

[2025-06-17 03:17:07] Beginning epoch 0...
Epoch 0:  94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏       | 16/17 [00:39<00:02,  2.49s/it, loss=0.0797, global_grad_norm=0.219, step=9, global_step=9, lr=5e-5]
[2025-06-17 03:17:47] Building buckets using 64 workers...
[2025-06-17 03:17:48] Bucket Info:
[2025-06-17 03:17:48] Bucket [#sample, #batch] by aspect ratio:
[2025-06-17 03:17:48] (16:9): #sample: 68, #batch: 68
[2025-06-17 03:17:48] ===== Image Info =====
[2025-06-17 03:17:48] Image Bucket by HxWxT:
[2025-06-17 03:17:48] --------------------------------
[2025-06-17 03:17:48] #image sample: 0, #image batch: 0
[2025-06-17 03:17:48] ===== Video Info =====
[2025-06-17 03:17:48] Video Bucket by HxWxT:
[2025-06-17 03:17:48] ('256px', 97): #sample: 68, #batch: 68
[2025-06-17 03:17:48] --------------------------------
[2025-06-17 03:17:48] #video sample: 68, #video batch: 68
[2025-06-17 03:17:48] ===== Summary =====
[2025-06-17 03:17:48] #non-empty buckets: 1
[2025-06-17 03:17:48] Img/Vid sample ratio: 0.00
[2025-06-17 03:17:48] Img/Vid batch ratio: 0.00
[2025-06-17 03:17:48] vid batch 256: 68, vid batch 768: 0
[2025-06-17 03:17:48] Vid batch ratio (256px/768px): 0.00
[2025-06-17 03:17:48] #training sample: 68, #training batch: 68
[2025-06-17 03:17:48] Beginning epoch 1...
Epoch 1:   0%|                                                                                                                                                                                                                  | 0/17 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Epoch 1:   0%|                                                                                                                                                                                                                  | 0/17 [00:01<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1131, in _try_get_data
[rank0]:     data = self._data_queue.get(timeout=timeout)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/multiprocessing/queues.py", line 113, in get
[rank0]:     if not self._poll(timeout):
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 257, in poll
[rank0]:     return self._poll(timeout)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 440, in _poll
[rank0]:     r = wait([self], timeout)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 948, in wait
[rank0]:     ready = selector.select(timeout)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/selectors.py", line 415, in select
[rank0]:     fd_event_list = self._selector.poll(timeout)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
[rank0]:     _error_if_any_worker_fails()
[rank0]: RuntimeError: DataLoader worker (pid 8864) is killed by signal: Segmentation fault. 

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/Open-Sora/scripts/diffusion/train.py", line 654, in <module>
[rank0]:     main()
[rank0]:   File "/home/Open-Sora/scripts/diffusion/train.py", line 529, in main
[rank0]:     batch_, step_, pinned_video_ = fetch_data()
[rank0]:                                    ^^^^^^^^^^^^
[rank0]:   File "/home/Open-Sora/scripts/diffusion/train.py", line 523, in fetch_data
[rank0]:     step, batch = next(pbar_iter)
[rank0]:                   ^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
[rank0]:     for obj in iterable:
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank0]:     data = self._next_data()
[rank0]:            ^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1327, in _next_data
[rank0]:     idx, data = self._get_data()
[rank0]:                 ^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1293, in _get_data
[rank0]:     success, data = self._try_get_data()
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1144, in _try_get_data
[rank0]:     raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
[rank0]: RuntimeError: DataLoader worker (pid(s) 8864) exited unexpectedly
[rank0]:[W617 03:17:50.058147529 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0617 03:17:51.687000 140319081027392 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 44 closing signal SIGTERM
W0617 03:17:51.689000 140319081027392 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 45 closing signal SIGTERM
W0617 03:17:51.696000 140319081027392 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 46 closing signal SIGTERM
E0617 03:17:51.725000 140319081027392 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 43) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/diffusion/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-06-17_03:17:51
  host      : app-trainingjob-ltm-lora5-gpajc-job-0-0.app-trainingjob-ltm-lora5-gpajc.1886048269027160066.svc.cluster.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 43)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Jun 17 '25 03:06 ltm920716