alpa
alpa copied to clipboard
[BUG] Collective group's rank is incorrect
Background
Alpa initializes collective groups for each cross-mesh communication pair. The call stack to initialize a collective group is:
create_collective_group
or init_collective_group
from collective.py
calls:
create_collective_group
of GroupManager
class in collective.py
calls:
NCCLGroup.__init__
with two different implementations. One is based on cupy, while the other is based on xla.
A NCCLGroup
creates and manages nccl communicators for each GPU in this node. When we need to call a nccl function, we finally goes into the NCCLGroup
to call it. However, in our current implementation, we use node_rank * num_devices_per_node + local_offset
to compute the rank of a local GPU w.r.t. the communication group. An example is here. This is correct in most cases, but when the send mesh has a different number of devices per node with the receive mesh, it is incorrect.
TODO
- [ ] Fix the bug above by adding a
start_gpu_rank
at the initialization ofNCCLGroup
. - [ ] Add tests for collective communications among meshes. For unit test on cross-mesh communication, please refer to this file.
cc @jiaodong
Goodday,
I am currently working on this bug
@AhmedMAlbreiki Please submit a PR so we can help review, thanks!
Hello, Im helping out with this issue, i have some questions for this
Currently the rank is computed in _get_nccl_collective_communicator
where in it sets is like so actual_rank = self.rank * len(device_list) + i
. Is this the issue in question where it needs to be replaced by start_gpu_rank = something magical
?
Still trying to fully understand the issue, thanks