xla
xla copied to clipboard
How to do multi-machine spmd training?
❓ Questions and Help
At present, I have passed the single-machine spmd training, but I do not know how to run the multi-machine spmd training. Could you give me a running example? @vanbasten23
Do you want to run it on GPU or TPU?
@mars1248 have been working on spmd on GPU so I assume it's GPU. https://github.com/pytorch/xla/issues/6256 gives a proposal about how to do multi-node spmd training.
eg. you have 2 GPU VMs, then you can run these 2 command on each VM respectively:
PJRT_DEVICE=CUDA torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 --rdzv_endpoint="<ip>:12355" spmdTest.py
PJRT_DEVICE=CUDA torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 --rdzv_endpoint="<ip>:12355" spmdTest.py
where
@vanbasten23 Thank you for your answer, I have multi-machine distributed training running. I would like to ask how to apply amp(automatic mixing precision) to spmd training mode?
AMP logic should be the same with or without SPMD, you can take a look at https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu
@JackCaoG 我按照下面的文档操作,但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗?
不管有没有SPMD,AMP逻辑应该是相同的,你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu
@JackCaoG 我按照下面的文档操作,但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗?
不管有没有SPMD,AMP逻辑应该是相同的,你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu
@JackCaoG 我定位到是训练时,模型参数的grad都是none导致的问题。