[BUG]: llama2 training on multi-node slurm, report error: errno: 98 - Address already in use
🐛 Describe the bug
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29601 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29601 (errno: 98 - Address already in use).
srun: error: HOST-10-140-60-1: tasks 0-5,7: Exited with exit code 1
Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=10.140.60.208 --master_port=29601 pretrain.py --config 70b --grad_checkpoint --batch_size 4 --num_epochs 30 --max_length 4096 --lr 3e-4 --weigth_decay 0.1 --warmup_steps 2000 --mixed_precision fp16 --save_interval 5000 --save_dir 70B_pretrain_ckpt on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
The start command as above, it reported an error when running multi-node model training on the cluster. I am sure the port was not been used. What can I do to trouble-shoot the problem ? How can I run the training on a cluster ?
Environment
torch 1.13.1+cu117 cuda-11.7 Python 3.10.0
ps -ef | grep
then kill -9 the demon which bind to the port
ps -ef | grep
then kill -9 the demon which bind to the port
I have done this before, however it doesn't work. And I'm sure the master address and port are available. Still thank you very much.
Finally , I solve the problem as below :
First, using python xx.py instead of colossalai run --nproc_per_node 8 xx.py works well.
So the start command is
srun -p {PARTITION} -N 4 -n 32 --ntasks-per-node=8 --gpus-per-task=1 python benchmark.py -c 70b -g -x -b 2 --max_length 4096 --tp 1 --pp 1 > train_70B_benchmark.log 2>&1 &
Second, there are some code modifications:
In benchmark.py line 79,
colossalai.launch_from_slurm(config={}, host=get_master_node(), port=12346)
, instead of
colossalai.launch_from_torch({})
And function get_master_node is :
def get_master_node():
import subprocess, os
if os.getenv("SLURM_JOB_ID") is None:
raise RuntimeError("get_master_node can only used in Slurm launch!")
result = subprocess.check_output('scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1', shell=True)
result = result.decode("utf8").strip()
return result