xtuner icon indicating copy to clipboard operation
xtuner copied to clipboard

How to train on multi nodes

Open slchenchn opened this issue 1 year ago • 1 comments

Hi, I'm trying to use your project for multi-node training but it seems configured only for single-node. Can you provide guidance or an update to support multi-node setups?

slchenchn avatar May 09 '24 08:05 slchenchn

@slchenchn

Take the training on 2x8 GPU as an example:

torch dist:

Note: $NODE_0_ADDR means the ip address of the node_0 machine.

# excuete on node 0
NPROC_PER_NODE=8 NNODES=2 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train $CONFIG --deepspeed $DS_CONFIG

# excuete on node 1
NPROC_PER_NODE=8 NNODES=2 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train $CONFIG --deepspeed $DS_CONFIG

slurm:

srun -p $PARTITION --nodes=2 --gres=gpu:8 --ntasks-per-node=8 xtuner train $CONFIG --deepspeed $DS_CONFIG --launcher slurm

LZHgrla avatar May 10 '24 02:05 LZHgrla