Open-Sora Training not working on 3 or 4 gpus

My training command: torchrun --standalone --nnodes=1 --nproc_per_node=4 Open-Sora/scripts/train.py Open-Sora/configs/opensora-v1-2/train/stage1.py --data-path test123.csv

Here is the full list of commands and the error:

(pytorch) root@fa05c50c4f4a:/workspace/Open-Sora# CUDA_VISIBLE_DEVICES=0,1,2,3
(pytorch) root@fa05c50c4f4a:/workspace/Open-Sora# torchrun --standalone --nnodes=1 --nproc_per_node=4 Open-Sora/scripts/train.py     Open-Sora/configs/opensora-v1-2/train/stage1.py --data-path test123.csv              
[2024-08-11 18:36:51,286] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-08-11 18:36:51,287] torch.distributed.run: [WARNING] 
[2024-08-11 18:36:51,287] torch.distributed.run: [WARNING] *****************************************
[2024-08-11 18:36:51,287] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-08-11 18:36:51,287] torch.distributed.run: [WARNING] *****************************************
[2024-08-11 18:37:01,496] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 159) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
===================================================
Open-Sora/scripts/train.py FAILED
---------------------------------------------------
Failures:
[1]:
  time      : 2024-08-11_18:37:01
  host      : fa05c50c4f4a
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 160)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 160
[2]:
  time      : 2024-08-11_18:37:01
  host      : fa05c50c4f4a
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 161)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 161
[3]:
  time      : 2024-08-11_18:37:01
  host      : fa05c50c4f4a
  rank      : 3 (local_rank: 3)
  exitcode  : -7 (pid: 162)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 162
---------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-11_18:37:01
  host      : fa05c50c4f4a
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 159)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 159

Does training only support 2 GPUs because i cant get it to work on 4 gpus. Would greatly appreciate some help.

Aug 11 '24 18:08 SamitM1

It seems like something wrong with your gpus. Try another cuda program

Aug 13 '24 08:08 tyz1994

This issue is stale because it has been open for 7 days with no activity.

Aug 21 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Aug 28 '24 01:08 github-actions[bot]