xgboost_ray
xgboost_ray copied to clipboard
Specifying network interface
Hi Xgboost_ray authors,
I just wonder if it is possible to specify the network interface used in xgboost_ray/xgboost. Currently I am running xgboost_benchmark.py in a shared testbed (https://www.cloudlab.us/) where each machine has one public network interface and one internal network interface. However, xgboost_ray/xgboost would automatically choose the public network interface, which has a much lower network bandwidth than the internal one.
This is what my machine has from ifconfig
, and I would like to use interface ens1f1
instead of eno49
. Is there any way to achieve that. Thanks in advance!
eno49: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 128.110.218.133 netmask 255.255.248.0 broadcast 128.110.223.255
inet6 fe80::9af2:b3ff:fecc:350 prefixlen 64 scopeid 0x20<link>
ether 98:f2:b3:cc:03:50 txqueuelen 1000 (Ethernet)
RX packets 466641549 bytes 669837883826 (669.8 GB)
RX errors 0 dropped 1 overruns 0 frame 0
TX packets 460255655 bytes 662032874073 (662.0 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.6.1 netmask 255.255.255.0 broadcast 192.168.6.255
inet6 fe80::9edc:71ff:fe49:a8c1 prefixlen 64 scopeid 0x20<link>
ether 9c:dc:71:49:a8:c1 txqueuelen 1000 (Ethernet)
RX packets 668893189 bytes 967152831349 (967.1 GB)
RX errors 0 dropped 8357 overruns 0 frame 0
TX packets 578906674 bytes 820686873410 (820.6 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 76349340 bytes 1636796444574 (1.6 TB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 76349340 bytes 1636796444574 (1.6 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Best, Yang
If I understand correctly, you are creating a cluster manually (by calling ray start
on each node). Is that right?
Yes. I use ray start to spawn ray processes on multiple machines:
# on head node (eg, node-0 on cloudlab)
ulimit -n 65536; ray start --head --port=6379 --node-ip-address=192.168.6.1
# on worker nodes (eg, node-1 on cloudlab)
ulimit -n 65536; ray start --address='192.168.6.1:6379' --node-ip-address=192.168.6.2
I also change ray get_node_ip_address() function to always return the IP address of the internal network interface, so that ray actor/task/object store communicates through the internal network interface.
But it seems xgboost_ray/xgboost uses its own collective communication framework that would automatically choose the public network interface.
Best, Yang
Hmm, xgboost_ray should use the IP returned of get_node_ip_address
. Can you run python -c "import ray; import ray.util; ray.init(); print(ray.util.get_node_ip_address())"
and see what is returned on the nodes?
It returns "192.168.6.1" on node-0, and "192.168.6.2" on node-1, etc.
How do you detect that xgboost chooses the wrong interface? One place where extra logging can be added is xgboost_ray.main._start_rabit_tracker
(print out the host IP).
I watch ifconfig
to see the traffic statistics going through different interfaces and find significant traffic (eg, ~0.5GB/s) going through the interface eno49
. I also get a warning from Cloudlab that I use too much public-facing network bandwidth.
I am not an expert in Rabit, but it looks like you may need to set up OS-level routing for it to work, or disable the other network interface. I'd also consider opening an issue in the xgboost repository.