Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

Signal 8 (SIGFPE) received by PID

Open yangqy1110 opened this issue 9 months ago • 1 comments

E0318 17:25:16.769000 139867688847168 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -8) local_rank: 0 (pid: 2967105) of binary: /apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/bin/python Traceback (most recent call last): File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/bin/torchrun", line 8, in sys.exit(main()) File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/apdcephfs_cq8/share_1367250/qinyuqyyang/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/diffusion/inference.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-03-18_17:25:16 host : TENCENT64.site rank : 0 (local_rank: 0) exitcode : -8 (pid: 2967105) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 2967105

yangqy1110 avatar Mar 18 '25 09:03 yangqy1110

Try running it without torchrun, you'll get the actual stacktrace when running it without parallelization.

alexandru-g avatar Mar 18 '25 10:03 alexandru-g

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Mar 26 '25 02:03 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Apr 02 '25 02:04 github-actions[bot]