torchft
torchft copied to clipboard
Unify replica terminology across the project
Currently in different places of code and docs these things are named differently:
- replica groups are sometimes called replicas (and replica_id sometimes means replica_group_id, replica_rank means replica_group_rank)
- replicas sometimes mean workers, but sometimes they mean replica groups
- etc.
Examples of the terminology problem: https://github.com/pytorch/torchft/blob/abc660b6a60b015dbff2a591508d1b4932ab20fa/torchft/torchx.py#L64 https://github.com/pytorch/torchft/blob/abc660b6a60b015dbff2a591508d1b4932ab20fa/torchft/torchx.py#L69 https://github.com/pytorch/torchft/blob/abc660b6a60b015dbff2a591508d1b4932ab20fa/src/manager.rs#L391
What can be done:
- define unified terminology for the key objects
- update docs and code to be aligned
Another related confusion that I had first encountering the code base is whether rank referred to the rank of the replicas or the rank of the devices within a replica group.
E.g. max_rank and primary_rank in manager.rs seems to refer to the replica_rank, whilst Manager._rank refers to the rank within a replica group.
I've tried to be consistent but yes, it is a bit confusing having two rank/replicas concepts.
Unfortunately for the code in torchx.py it's interacting with torchelastic which doesn't have the concept of replica groups so we don't have much choice other than do the translation there.
For the code in manager we should really distinguish everywhere between group_rank and replica_rank
Maybe torchft does not need a concept of replica - only replication groups consisting of nodes. Then "replica" term will be used only in torchx. This will be simpler if the same term like "replica" does not mean two different things in the same file.
As replica group is actually a collection of nodes that have the full model and process a shard of data, should it be named data group or ddp group or data shard or ddp shard?
@rualark I believe that replica_group is a commonly used term in distributed systems. It may be good to keep the terminology consistent with the literature.
@d4l3k I would also be happy to make amendments to make everything group_rank/replica_rank.
I searched through all the occurances of "rank" in the torchFT repo, and came up with the required changes. However, I am not sure whether the names in the exposed methods should change.
E.g.
manager.participating_rank() -> manager.participating_replica_rank()
or in manager's init function, we now have group_rank -> rank.
I am personally in favor of making this as explicit as possible, as a big hurdle for me when I first began understanding torchFT is to keep track of the two notions of rank.
Identified Changes
Global Changes
recover_src_rank -> recover_src_replica_rank recover_dst_ranks -> recover_src_dst_replica_rank recovering_rank -> recovering_replica_rank max_rank -> max_replica_rank
manager.py
self._rank -> self._group_rank rank -> group_rank participating_rank -> participating_replica_rank
The interface is kept the same, with Manager(..., rank=x, ...)
lib.rs:
rank should become group_rank in the comments documenting struct ManagerServer
All the rank under impl ManagerClient should become group_rank.
torchft.proto
rank -> replica_rank for ManagerQuorumRequest and CheckpointMetadataRequest rank -> group_rank for ShouldCommitRequest
CheckpointMetadataRequest { rank -> CheckpointMetadataRequest { group_rank
manager.rs
All rank (without replica_ in front) -> replica_rank, except for under should_commit, where we have group_rank.
_torchft.pyi
Change the api according to above.
data.py
rank -> group_rank replica_group -> replica_rank
Note: I did not change the interface for any class to not break backward compatibility.
Device-Mesh/Process Group/transport
There is no need to distinguish here.
FSDP Test
It is already implemented correctly.
@WarrenZhu050413 sure, replica group is also ok.
- replica_rank is the index of the replica group in the list of alive replica groups, right? then should it be "replica group live id"? (IIUC this is currently called replica_rank, but I am not sure if rank is good here as it usually refers to nodes? please correct me if I am wrong)
- recovery src rank - is this "recovery src replica group live id"?
- recovery_dst_rank - is this "recovery dst replica group live id"?
- max_rank - is this "live replica group count"?
I think as long as we clearly definite the terms within torchft + document them we should be fine even if replica_rank and group_rank aren't entirely self describing.
I think I might die of frustration if I had to type recovery_dst_replica_group_live_id repeatedly 😆
@rualark There's a separate concept of replica_id which is a unique string identifying each replica group used in quorum. The replica_rank is a mapped from the replica_ids into an ordered list of integers (ranks) + world_size so I do think the rank terminology is better than id in this case.
@WarrenZhu050413 those changes sound reasonable to me. In terms of backwards compatibility, it would be good to avoid breaking torchtitan. I would do all the changes that aren't breaking compatibility and then we can document in the doc string whether it's a replica or group rank.
Thanks @d4l3k Also true. Agree, documenting is another approach to solve this problem.