xla icon indicating copy to clipboard operation
xla copied to clipboard

[GPU] Multinode support for multihost_hlo_runner

Open trevor-m opened this issue 1 year ago • 2 comments
trafficstars

This PR adds multinode support to the multihost_hlo_runner. To use, there is a new command line argument address for the address of the coordinator/root node.

Example usage with SLURM:

bazel run //xla/tools/multihost_hlo_runner:hlo_runner_main -- \
  --task_id=${SLURM_PROCID} \
  --num_nodes=${SLURM_NTASKS} \
  --address="${SLURM_LAUNCH_NODE_IPADDR}:12345" \
  ...

trevor-m avatar May 13 '24 23:05 trevor-m

@PatriosTheGreat

trevor-m avatar May 13 '24 23:05 trevor-m

Looks good to me, @ezhulenev @hawkinsp any thoughts?

cheshire avatar May 15 '24 16:05 cheshire