Open-Sora
Open-Sora copied to clipboard
Signal 8 (SIGFPE) received by PID
E0318 17:25:16.769000 139867688847168 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -8) local_rank: 0 (pid: 2967105) of binary: /apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/bin/python
Traceback (most recent call last):
File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/bin/torchrun", line 8, in
sys.exit(main())
File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/diffusion/inference.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-03-18_17:25:16 host : TENCENT64.site rank : 0 (local_rank: 0) exitcode : -8 (pid: 2967105) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 2967105
Try running it without torchrun, you'll get the actual stacktrace when running it without parallelization.
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.