Open-Sora
Open-Sora copied to clipboard
Training not working on 3 or 4 gpus
My training command: torchrun --standalone --nnodes=1 --nproc_per_node=4 Open-Sora/scripts/train.py Open-Sora/configs/opensora-v1-2/train/stage1.py --data-path test123.csv
Here is the full list of commands and the error:
(pytorch) root@fa05c50c4f4a:/workspace/Open-Sora# CUDA_VISIBLE_DEVICES=0,1,2,3
(pytorch) root@fa05c50c4f4a:/workspace/Open-Sora# torchrun --standalone --nnodes=1 --nproc_per_node=4 Open-Sora/scripts/train.py Open-Sora/configs/opensora-v1-2/train/stage1.py --data-path test123.csv
[2024-08-11 18:36:51,286] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-08-11 18:36:51,287] torch.distributed.run: [WARNING]
[2024-08-11 18:36:51,287] torch.distributed.run: [WARNING] *****************************************
[2024-08-11 18:36:51,287] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-08-11 18:36:51,287] torch.distributed.run: [WARNING] *****************************************
[2024-08-11 18:37:01,496] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 159) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
===================================================
Open-Sora/scripts/train.py FAILED
---------------------------------------------------
Failures:
[1]:
time : 2024-08-11_18:37:01
host : fa05c50c4f4a
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 160)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 160
[2]:
time : 2024-08-11_18:37:01
host : fa05c50c4f4a
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 161)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 161
[3]:
time : 2024-08-11_18:37:01
host : fa05c50c4f4a
rank : 3 (local_rank: 3)
exitcode : -7 (pid: 162)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 162
---------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-11_18:37:01
host : fa05c50c4f4a
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 159)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 159
Does training only support 2 GPUs because i cant get it to work on 4 gpus. Would greatly appreciate some help.
It seems like something wrong with your gpus. Try another cuda program
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.