xgboost_ray icon indicating copy to clipboard operation
xgboost_ray copied to clipboard

Specifying network interface

Open YangZhou1997 opened this issue 2 years ago • 7 comments

Hi Xgboost_ray authors,

I just wonder if it is possible to specify the network interface used in xgboost_ray/xgboost. Currently I am running xgboost_benchmark.py in a shared testbed (https://www.cloudlab.us/) where each machine has one public network interface and one internal network interface. However, xgboost_ray/xgboost would automatically choose the public network interface, which has a much lower network bandwidth than the internal one.

This is what my machine has from ifconfig, and I would like to use interface ens1f1 instead of eno49. Is there any way to achieve that. Thanks in advance!

eno49: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 128.110.218.133  netmask 255.255.248.0  broadcast 128.110.223.255
        inet6 fe80::9af2:b3ff:fecc:350  prefixlen 64  scopeid 0x20<link>
        ether 98:f2:b3:cc:03:50  txqueuelen 1000  (Ethernet)
        RX packets 466641549  bytes 669837883826 (669.8 GB)
        RX errors 0  dropped 1  overruns 0  frame 0
        TX packets 460255655  bytes 662032874073 (662.0 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.6.1  netmask 255.255.255.0  broadcast 192.168.6.255
        inet6 fe80::9edc:71ff:fe49:a8c1  prefixlen 64  scopeid 0x20<link>
        ether 9c:dc:71:49:a8:c1  txqueuelen 1000  (Ethernet)
        RX packets 668893189  bytes 967152831349 (967.1 GB)
        RX errors 0  dropped 8357  overruns 0  frame 0
        TX packets 578906674  bytes 820686873410 (820.6 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 76349340  bytes 1636796444574 (1.6 TB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 76349340  bytes 1636796444574 (1.6 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Best, Yang

YangZhou1997 avatar Jan 23 '23 18:01 YangZhou1997

If I understand correctly, you are creating a cluster manually (by calling ray start on each node). Is that right?

Yard1 avatar Jan 24 '23 22:01 Yard1

Yes. I use ray start to spawn ray processes on multiple machines:

# on head node (eg, node-0 on cloudlab)
ulimit -n 65536; ray start --head --port=6379 --node-ip-address=192.168.6.1

# on worker nodes (eg, node-1 on cloudlab)
ulimit -n 65536; ray start --address='192.168.6.1:6379' --node-ip-address=192.168.6.2

I also change ray get_node_ip_address() function to always return the IP address of the internal network interface, so that ray actor/task/object store communicates through the internal network interface.

But it seems xgboost_ray/xgboost uses its own collective communication framework that would automatically choose the public network interface.

Best, Yang

YangZhou1997 avatar Jan 24 '23 23:01 YangZhou1997

Hmm, xgboost_ray should use the IP returned of get_node_ip_address. Can you run python -c "import ray; import ray.util; ray.init(); print(ray.util.get_node_ip_address())" and see what is returned on the nodes?

Yard1 avatar Jan 25 '23 04:01 Yard1

It returns "192.168.6.1" on node-0, and "192.168.6.2" on node-1, etc.

YangZhou1997 avatar Jan 25 '23 04:01 YangZhou1997

How do you detect that xgboost chooses the wrong interface? One place where extra logging can be added is xgboost_ray.main._start_rabit_tracker (print out the host IP).

Yard1 avatar Jan 25 '23 04:01 Yard1

I watch ifconfig to see the traffic statistics going through different interfaces and find significant traffic (eg, ~0.5GB/s) going through the interface eno49. I also get a warning from Cloudlab that I use too much public-facing network bandwidth.

YangZhou1997 avatar Jan 25 '23 04:01 YangZhou1997

I am not an expert in Rabit, but it looks like you may need to set up OS-level routing for it to work, or disable the other network interface. I'd also consider opening an issue in the xgboost repository.

Yard1 avatar Jan 25 '23 04:01 Yard1