ColossalAI [BUG]: llama2 training on multi-node slurm, report error: errno: 98

🐛 Describe the bug

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29601 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29601 (errno: 98 - Address already in use).
srun: error: HOST-10-140-60-1: tasks 0-5,7: Exited with exit code 1
Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=10.140.60.208 --master_port=29601 pretrain.py --config 70b --grad_checkpoint --batch_size 4 --num_epochs 30 --max_length 4096 --lr 3e-4 --weigth_decay 0.1 --warmup_steps 2000 --mixed_precision fp16 --save_interval 5000 --save_dir 70B_pretrain_ckpt on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

The start command as above, it reported an error when running multi-node model training on the cluster. I am sure the port was not been used. What can I do to trouble-shoot the problem ? How can I run the training on a cluster ?

Environment

torch 1.13.1+cu117 cuda-11.7 Python 3.10.0

Nov 15 '23 04:11 yeegnauh

ps -ef | grep

then kill -9 the demon which bind to the port

Nov 15 '23 09:11 hellangleZ

ps -ef | grep

then kill -9 the demon which bind to the port

I have done this before, however it doesn't work. And I'm sure the master address and port are available. Still thank you very much.

Nov 15 '23 10:11 yeegnauh

Finally , I solve the problem as below :

First, using python xx.py instead of colossalai run --nproc_per_node 8 xx.py works well.

So the start command is

srun -p {PARTITION} -N 4 -n 32 --ntasks-per-node=8 --gpus-per-task=1 python benchmark.py -c 70b -g -x -b 2 --max_length 4096  --tp 1 --pp 1 > train_70B_benchmark.log 2>&1 &

Second, there are some code modifications：

In benchmark.py line 79,

colossalai.launch_from_slurm(config={}, host=get_master_node(), port=12346)

, instead of

colossalai.launch_from_torch({})

And function get_master_node is :

def get_master_node():
    import subprocess, os
    if os.getenv("SLURM_JOB_ID") is None:
        raise RuntimeError("get_master_node can only used in Slurm launch!")
    result = subprocess.check_output('scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1', shell=True)
    result = result.decode("utf8").strip()
    return result

Nov 15 '23 11:11 yeegnauh

[BUG]: llama2 training on multi-node slurm, report error: errno: 98 - Address already in use

🐛 Describe the bug

Environment