xla icon indicating copy to clipboard operation
xla copied to clipboard

[NVIDIA GPU] Use cuda runtime api to determine if 2 ranks are on the same host

Open Tixxx opened this issue 1 year ago • 1 comments

The current logic in nccl clique sets is_local to true by looking at the number of local participants and total devices in the clique. It's been used to determine if replica group is a local group, but this doesn't always translates to a local communicator. i.e on a 8-gpu machine, if we have a group for collective permute replica_group={{0,1},{1,0},{9,10},{10,9}}

In XLA's perspective, this is not a local replica group, but the communicators we create are only for rank (0,1) and (10,9), which are both local communicators in nccl perspective. This pr uses the cuda runtime api to get the number of devices on a host and use the current rank id to determine if source and targets are located in the same host in collective permute thunk.

Tixxx avatar Aug 01 '24 18:08 Tixxx