torch.distributed.elastic.multiprocessing.api
Traceback (most recent call last):
File "/root/anaconda3/envs/opensora/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/inference.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-04-16_16:40:07 host : e2e-84-47.ssdcloudindia.net rank : 0 (local_rank: 0) exitcode : -11 (pid: 28813) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 28813
It could be due to the mismatch between cuda and pytorch versions. Run nvcc --version and python -c 'import torch; print(torch.version.cuda);' to see if they match.
Same error.
and nvcc --version matches python -c 'import torch; print(torch.version.cuda);'
Same error. and
nvcc --versionmatchespython -c 'import torch; print(torch.version.cuda);'
also found with dmesg:
same
nvcc -v and python -c 'import torch; print(torch.version.cuda);' return same cuda version
11.8
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
hi, I downgrade torch to 2.1.2 and resolve the problem(also changed xformers version to v0.0.23.post1). here is how I locate problem: 1.debug with pdb, found is torch.nn.Conv3d raise segmentation fault 2.searched and got a known issue, which says it is an oneDNN upgrade issue, pytorch 2.1.2 can work, check: https://github.com/pytorch/pytorch/issues/120406
hope helpful.
Thanks for sharing @erichtho . Would this solve your issue as well? @MrD005
thanks it solved the problem
