xtuner
xtuner copied to clipboard
How to train on multi nodes
Hi, I'm trying to use your project for multi-node training but it seems configured only for single-node. Can you provide guidance or an update to support multi-node setups?
@slchenchn
Take the training on 2x8 GPU as an example:
torch dist:
Note:
$NODE_0_ADDRmeans the ip address of the node_0 machine.
# excuete on node 0
NPROC_PER_NODE=8 NNODES=2 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train $CONFIG --deepspeed $DS_CONFIG
# excuete on node 1
NPROC_PER_NODE=8 NNODES=2 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train $CONFIG --deepspeed $DS_CONFIG
slurm:
srun -p $PARTITION --nodes=2 --gres=gpu:8 --ntasks-per-node=8 xtuner train $CONFIG --deepspeed $DS_CONFIG --launcher slurm