ColossalAI
ColossalAI copied to clipboard
[BUG]: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
🐛 Describe the bug
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 130182) of binary: /usr/bin/python3.8
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-04-15_19:02:54 host : I11f4bfe327002017cc rank : 0 (local_rank: 0) exitcode : 1 (pid: 130182) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --do_train --cuda --do_valid --do_test --data_path data/wn18rr --model RotatE -n 256 -b 1024 -d 1000 -g 24.0 -a 1.0 -adv -lr 0.0001 --max_steps 150000 -save results/RotatE_wn18rr_0 --test_batch_size 16 -de on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
Environment
No response
Hi, the error log is not very meaningful to me. Can you share your command and your full error log?