alpa icon indicating copy to clipboard operation
alpa copied to clipboard

[BUG] Collective group's rank is incorrect

Open ZYHowell opened this issue 2 years ago • 4 comments

Background

Alpa initializes collective groups for each cross-mesh communication pair. The call stack to initialize a collective group is: create_collective_group or init_collective_group from collective.py calls: create_collective_group of GroupManager class in collective.py calls: NCCLGroup.__init__ with two different implementations. One is based on cupy, while the other is based on xla.

A NCCLGroup creates and manages nccl communicators for each GPU in this node. When we need to call a nccl function, we finally goes into the NCCLGroup to call it. However, in our current implementation, we use node_rank * num_devices_per_node + local_offset to compute the rank of a local GPU w.r.t. the communication group. An example is here. This is correct in most cases, but when the send mesh has a different number of devices per node with the receive mesh, it is incorrect.

TODO

  • [ ] Fix the bug above by adding a start_gpu_rank at the initialization of NCCLGroup.
  • [ ] Add tests for collective communications among meshes. For unit test on cross-mesh communication, please refer to this file.

ZYHowell avatar Dec 01 '22 02:12 ZYHowell

cc @jiaodong

ZYHowell avatar Dec 01 '22 02:12 ZYHowell

Goodday,

I am currently working on this bug

AhmedMAlbreiki avatar Feb 07 '23 09:02 AhmedMAlbreiki

@AhmedMAlbreiki Please submit a PR so we can help review, thanks!

zhisbug avatar Feb 07 '23 19:02 zhisbug

Hello, Im helping out with this issue, i have some questions for this

Currently the rank is computed in _get_nccl_collective_communicator where in it sets is like so actual_rank = self.rank * len(device_list) + i. Is this the issue in question where it needs to be replaced by start_gpu_rank = something magical?

Still trying to fully understand the issue, thanks

AhmedRAlmansoori avatar Feb 27 '23 07:02 AhmedRAlmansoori