Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

torch.distributed.elastic.multiprocessing.api

Open MrD005 opened this issue 1 year ago • 6 comments

Traceback (most recent call last): File "/root/anaconda3/envs/opensora/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/inference.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-16_16:40:07 host : e2e-84-47.ssdcloudindia.net rank : 0 (local_rank: 0) exitcode : -11 (pid: 28813) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 28813

MrD005 avatar Apr 16 '24 11:04 MrD005

It could be due to the mismatch between cuda and pytorch versions. Run nvcc --version and python -c 'import torch; print(torch.version.cuda);' to see if they match.

JThh avatar Apr 16 '24 19:04 JThh

Same error. and nvcc --version matches python -c 'import torch; print(torch.version.cuda);' 截屏2024-04-17 11 16 13

erichtho avatar Apr 17 '24 03:04 erichtho

Same error. and nvcc --version matches python -c 'import torch; print(torch.version.cuda);' 截屏2024-04-17 11 16 13

also found with dmesg: 截屏2024-04-17 15 19 03

erichtho avatar Apr 17 '24 07:04 erichtho

same nvcc -v and python -c 'import torch; print(torch.version.cuda);' return same cuda version

11.8

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

MrD005 avatar Apr 17 '24 08:04 MrD005

hi, I downgrade torch to 2.1.2 and resolve the problem(also changed xformers version to v0.0.23.post1). here is how I locate problem: 1.debug with pdb, found is torch.nn.Conv3d raise segmentation fault 2.searched and got a known issue, which says it is an oneDNN upgrade issue, pytorch 2.1.2 can work, check: https://github.com/pytorch/pytorch/issues/120406

hope helpful.

erichtho avatar Apr 17 '24 11:04 erichtho

Thanks for sharing @erichtho . Would this solve your issue as well? @MrD005

JThh avatar Apr 17 '24 13:04 JThh

thanks it solved the problem

MrD005 avatar Apr 17 '24 20:04 MrD005