xla How to do multi-machine spmd training？

❓ Questions and Help

At present, I have passed the single-machine spmd training, but I do not know how to run the multi-machine spmd training. Could you give me a running example？ @vanbasten23

Jan 23 '24 03:01 mars1248

Do you want to run it on GPU or TPU?

Jan 23 '24 18:01 JackCaoG

@mars1248 have been working on spmd on GPU so I assume it's GPU. https://github.com/pytorch/xla/issues/6256 gives a proposal about how to do multi-node spmd training.

eg. you have 2 GPU VMs, then you can run these 2 command on each VM respectively:

PJRT_DEVICE=CUDA torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 --rdzv_endpoint="<ip>:12355" spmdTest.py
PJRT_DEVICE=CUDA torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 --rdzv_endpoint="<ip>:12355" spmdTest.py

where is the internal IP address of the first GPU VM.

Jan 24 '24 03:01 vanbasten23

@vanbasten23 Thank you for your answer, I have multi-machine distributed training running. I would like to ask how to apply amp(automatic mixing precision) to spmd training mode?

Jan 31 '24 04:01 mars1248

AMP logic should be the same with or without SPMD, you can take a look at https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu

Jan 31 '24 18:01 JackCaoG

@JackCaoG 我按照下面的文档操作，但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗？

不管有没有SPMD，AMP逻辑应该是相同的，你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu

Feb 21 '24 09:02 mars1248

@JackCaoG 我按照下面的文档操作，但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗？

不管有没有SPMD，AMP逻辑应该是相同的，你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu

@JackCaoG 我定位到是训练时，模型参数的grad都是none导致的问题。

Feb 21 '24 09:02 mars1248

xla xla copied to clipboard

How to do multi-machine spmd training？

❓ Questions and Help

xla
xla copied to clipboard