Currently in different places of code and docs these things are named differently:

replica groups are sometimes called replicas (and replica_id sometimes means replica_group_id, replica_rank means replica_group_rank)
replicas sometimes mean workers, but sometimes they mean replica groups
etc.

Examples of the terminology problem: https://github.com/pytorch/torchft/blob/abc660b6a60b015dbff2a591508d1b4932ab20fa/torchft/torchx.py#L64 https://github.com/pytorch/torchft/blob/abc660b6a60b015dbff2a591508d1b4932ab20fa/torchft/torchx.py#L69 https://github.com/pytorch/torchft/blob/abc660b6a60b015dbff2a591508d1b4932ab20fa/src/manager.rs#L391

What can be done:

define unified terminology for the key objects
update docs and code to be aligned

Apr 30 '25 11:04 rualark

Another related confusion that I had first encountering the code base is whether rank referred to the rank of the replicas or the rank of the devices within a replica group.

E.g. max_rank and primary_rank in manager.rs seems to refer to the replica_rank, whilst Manager._rank refers to the rank within a replica group.

Apr 30 '25 18:04 WarrenZhu050413

I've tried to be consistent but yes, it is a bit confusing having two rank/replicas concepts.

Unfortunately for the code in torchx.py it's interacting with torchelastic which doesn't have the concept of replica groups so we don't have much choice other than do the translation there.

For the code in manager we should really distinguish everywhere between group_rank and replica_rank

Apr 30 '25 21:04 d4l3k

Maybe torchft does not need a concept of replica - only replication groups consisting of nodes. Then "replica" term will be used only in torchx. This will be simpler if the same term like "replica" does not mean two different things in the same file.

May 01 '25 00:05 rualark

As replica group is actually a collection of nodes that have the full model and process a shard of data, should it be named data group or ddp group or data shard or ddp shard?

May 01 '25 00:05 rualark

@rualark I believe that replica_group is a commonly used term in distributed systems. It may be good to keep the terminology consistent with the literature.

May 01 '25 05:05 WarrenZhu050413

@d4l3k I would also be happy to make amendments to make everything group_rank/replica_rank.

I searched through all the occurances of "rank" in the torchFT repo, and came up with the required changes. However, I am not sure whether the names in the exposed methods should change.

E.g.

manager.participating_rank() -> manager.participating_replica_rank()

or in manager's init function, we now have group_rank -> rank.

I am personally in favor of making this as explicit as possible, as a big hurdle for me when I first began understanding torchFT is to keep track of the two notions of rank.

Identified Changes

Global Changes

recover_src_rank -> recover_src_replica_rank recover_dst_ranks -> recover_src_dst_replica_rank recovering_rank -> recovering_replica_rank max_rank -> max_replica_rank

manager.py

self._rank -> self._group_rank rank -> group_rank participating_rank -> participating_replica_rank

The interface is kept the same, with Manager(..., rank=x, ...)

lib.rs:

rank should become group_rank in the comments documenting struct ManagerServer All the rank under impl ManagerClient should become group_rank.

torchft.proto

rank -> replica_rank for ManagerQuorumRequest and CheckpointMetadataRequest rank -> group_rank for ShouldCommitRequest

CheckpointMetadataRequest { rank -> CheckpointMetadataRequest { group_rank

manager.rs

All rank (without replica_ in front) -> replica_rank, except for under should_commit, where we have group_rank.

_torchft.pyi

Change the api according to above.

data.py

rank -> group_rank replica_group -> replica_rank

Note: I did not change the interface for any class to not break backward compatibility.

Device-Mesh/Process Group/transport

There is no need to distinguish here.

FSDP Test

It is already implemented correctly.

May 01 '25 05:05 WarrenZhu050413

@WarrenZhu050413 sure, replica group is also ok.

May 01 '25 12:05 rualark

replica_rank is the index of the replica group in the list of alive replica groups, right? then should it be "replica group live id"? (IIUC this is currently called replica_rank, but I am not sure if rank is good here as it usually refers to nodes? please correct me if I am wrong)
recovery src rank - is this "recovery src replica group live id"?
recovery_dst_rank - is this "recovery dst replica group live id"?
max_rank - is this "live replica group count"?

May 01 '25 12:05 rualark

I think as long as we clearly definite the terms within torchft + document them we should be fine even if replica_rank and group_rank aren't entirely self describing.

I think I might die of frustration if I had to type recovery_dst_replica_group_live_id repeatedly 😆

@rualark There's a separate concept of replica_id which is a unique string identifying each replica group used in quorum. The replica_rank is a mapped from the replica_ids into an ordered list of integers (ranks) + world_size so I do think the rank terminology is better than id in this case.

@WarrenZhu050413 those changes sound reasonable to me. In terms of backwards compatibility, it would be good to avoid breaking torchtitan. I would do all the changes that aren't breaking compatibility and then we can document in the doc string whether it's a replica or group rank.

May 01 '25 17:05 d4l3k

Thanks @d4l3k Also true. Agree, documenting is another approach to solve this problem.

May 01 '25 20:05 rualark

torchft
torchft copied to clipboard

Unify replica terminology across the project

Identified Changes

Global Changes

manager.py

lib.rs:

torchft.proto

manager.rs

_torchft.pyi

data.py

Device-Mesh/Process Group/transport

FSDP Test

torchft torchft copied to clipboard

Unify replica terminology across the project

Identified Changes

Global Changes

manager.py

lib.rs:

torchft.proto

manager.rs

_torchft.pyi

data.py

Device-Mesh/Process Group/transport

FSDP Test

torchft
torchft copied to clipboard