xla icon indicating copy to clipboard operation
xla copied to clipboard

How to do multi-machine spmd training?

Open mars1248 opened this issue 1 year ago • 6 comments

❓ Questions and Help

At present, I have passed the single-machine spmd training, but I do not know how to run the multi-machine spmd training. Could you give me a running example? @vanbasten23

mars1248 avatar Jan 23 '24 03:01 mars1248

Do you want to run it on GPU or TPU?

JackCaoG avatar Jan 23 '24 18:01 JackCaoG

@mars1248 have been working on spmd on GPU so I assume it's GPU. https://github.com/pytorch/xla/issues/6256 gives a proposal about how to do multi-node spmd training.

eg. you have 2 GPU VMs, then you can run these 2 command on each VM respectively:

PJRT_DEVICE=CUDA torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 --rdzv_endpoint="<ip>:12355" spmdTest.py
PJRT_DEVICE=CUDA torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 --rdzv_endpoint="<ip>:12355" spmdTest.py

where is the internal IP address of the first GPU VM.

vanbasten23 avatar Jan 24 '24 03:01 vanbasten23

@vanbasten23 Thank you for your answer, I have multi-machine distributed training running. I would like to ask how to apply amp(automatic mixing precision) to spmd training mode?

mars1248 avatar Jan 31 '24 04:01 mars1248

AMP logic should be the same with or without SPMD, you can take a look at https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu

JackCaoG avatar Jan 31 '24 18:01 JackCaoG

@JackCaoG 我按照下面的文档操作,但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗?

不管有没有SPMD,AMP逻辑应该是相同的,你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu

mars1248 avatar Feb 21 '24 09:02 mars1248

@JackCaoG 我按照下面的文档操作,但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗?

不管有没有SPMD,AMP逻辑应该是相同的,你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu

@JackCaoG 我定位到是训练时,模型参数的grad都是none导致的问题。 image

mars1248 avatar Feb 21 '24 09:02 mars1248