xla
xla copied to clipboard
[GPU] Multinode support for multihost_hlo_runner
trafficstars
This PR adds multinode support to the multihost_hlo_runner. To use, there is a new command line argument address for the address of the coordinator/root node.
Example usage with SLURM:
bazel run //xla/tools/multihost_hlo_runner:hlo_runner_main -- \
--task_id=${SLURM_PROCID} \
--num_nodes=${SLURM_NTASKS} \
--address="${SLURM_LAUNCH_NODE_IPADDR}:12345" \
...
@PatriosTheGreat
Looks good to me, @ezhulenev @hawkinsp any thoughts?