examples icon indicating copy to clipboard operation
examples copied to clipboard

`local_rank` or `rank` for multi-node FSDP

Open Emerald01 opened this issue 1 year ago • 0 comments

I am wondering for multi-node FSDP, does local_rank and rank have any obvious difference here? I think I understand that local_rank is the rank within a node.

I see in a few places it looks like local_rank is specifically used

For example

https://github.com/pytorch/examples/blob/main/distributed/FSDP/T5_training.py#L111 torch.cuda.set_device(local_rank)

and https://github.com/pytorch/examples/blob/main/distributed/FSDP/utils/train_utils.py#L48 batch[key] = batch[key].to(local_rank)

Is there any problem if using rank instead?

Emerald01 avatar May 30 '24 19:05 Emerald01