soda icon indicating copy to clipboard operation
soda copied to clipboard

Issues with distributed training environment

Open V1oletM opened this issue 10 months ago • 1 comments

Hi, FutureXiang Thanks for your code! When I'm training CIFAR-10, I encounter an error during distributed training. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 75972) of binary: /home/wangyiming/anaconda3/envs/diffusion/bin/python Traceback (most recent call last): File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: I'm not sure if it's a version issue. Could you please provide the environment.yaml file? Thanks!

V1oletM avatar Apr 11 '24 12:04 V1oletM