llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

Llama-3 encounters ncclSystemError while training with ZeRO-2 and transformer.trainer.

Open Esperanto-mega opened this issue 10 months ago • 0 comments

When simply replacing 'llama-2 7b' with 'llama-3 8b' in a finished repo, a NCCL error occurs, with traceback as follows.

Traceback (most recent call last): File "pre-train.py", line 124, in trainer.train(resume_from_checkpoint = args.resume_from_checkpoint) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer.py", line 2644, in _maybe_log_save_evaluate tr_loss_scalar = self._nested_gather(tr_loss).mean().item() File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer.py", line 3756, in _nested_gather tensors = distributed_concat(tensors) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 221, in distributed_concat dist.all_gather(output_tensors, tensor) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2275, in all_gather work = default_pg.allgather([tensor_list], [tensor]) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.17.1 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: socketStartConnect: Connect to 10.19.35.240<58809> failed : Software caused connection abort

By the way, I have no permission to turn off the firewall of this compute source.

And I wonder why this issue happens and how to it.

Esperanto-mega avatar Apr 19 '24 08:04 Esperanto-mega