determined
determined copied to clipboard
chore: FOUNDENG-55 For deepspeed, set the NCCL_SOCKET_IFNAME env variable based on dtrain_network_interface
Description
There are 3 issues being addressed in this PR.
The first issue is that the "LD_LIBRARY_PATH" variable was not in the INCLUDE list of environment variables that needed to be passed to deepspeed, so the experiment failed with the following error, because the CUDA libraries could not be found.
[rank=0] RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
The second issue is that the "/tmp/hostfile.txt" is being written to a shared "/tmp" directory, so all users write to the same file. If the file already exists and is owned by another user, it will fail to be created by the current user. It is also a security issue to write to well-known file names in a world-writable directory, such as "/tmp", because it's vulnerable to a symlink attack.
The third issue is that the NCCL library was picking the wrong interface. The details are as follows:
On the "horizon" cluster, which does not have Infiniband (has the Gemini Fabric (IPoGIF) module), the Nvidia NCCL code is picking up the wrong interface. It is picking "rsip", which has a "172.x.x.x" address, instead of "ipogif0", which has a "10.x.x.x" address.
nid00153:~ # ifconfig -a
ipogif0: flags=193<UP,RUNNING,NOARP> mtu 65520
inet 10.128.0.154 netmask 255.252.0.0
ether 00:01:01:00:00:99 txqueuelen 1000 (Ethernet)
RX packets 225854 bytes 1042470410 (994.1 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 231281 bytes 1038038963 (989.9 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 539451 bytes 29386310 (28.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 539451 bytes 29386310 (28.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
rsip: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1520
inet 172.30.48.181 netmask 255.255.255.255 destination 172.30.48.181
tunnel txqueuelen 1000 (IPIP Tunnel)
RX packets 18378 bytes 20327676 (19.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 16270 bytes 1448965 (1.3 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tunl0: flags=128<NOARP> mtu 1480
tunnel txqueuelen 1000 (IPIP Tunnel)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
As a result, the experiment fails because the workers cannot communicate with the chief over the "172.x.x.x" address.
corujor@horizon:~> grep 181 /home/users/launcher/.capsules/dispatcher/environments/corujor/f56e0a24780b4d3e-95d673d2c3e54aad/ai_exp-761-trial-738-output.log
[rank=1] nid00154:32841:32841 [0] NCCL INFO NET/Socket : Using [0]ipogif0:10.128.0.155<0> [1]rsip:172.30.48.181<0>
[rank=0] nid00153:3063:3063 [0] NCCL INFO NET/Socket : Using [0]ipogif0:10.128.0.154<0> [1]rsip:172.30.48.181<0>
[rank=1] nid00154:32841:32964 [0] include/socket.h:409 NCCL WARN Net : Connect to 172.30.48.181<46127> failed : Connection refused
[rank=0] nid00153:3063:3189 [0] include/socket.h:409 NCCL WARN Net : Connect to 172.30.48.181<37161> failed : Connection refused
The user does have the ability to set the distributed training network interface in the "master.yaml".
task_container_defaults:
# shm_size_bytes: 4294967296
# network_mode: bridge
dtrain_network_interface: ipogif0
# nccl_port_range: <MIN:MAX>
# gloo_port_range: <MIN:MAX>
However, unlike Horovod, which gets passed the distributed training network interface as a parameter (see below), "deepspeed.py" does not.
hvd_cmd = horovod.create_run_command(
num_proc_per_machine=len(info.slot_ids),
ip_addresses=info.container_addrs,
inter_node_network_interface=info.trial._inter_node_network_interface,
optimizations=experiment_config["optimizations"],
debug=debug,
optional_args=hvd_optional_args,
)
We had originally modified "deepspeed.py" in the "dispatcher" branch to automatically pick the network interface that is used to connect to the chief's IP addresss and set NCCL_SOCKET_IFNAME to that interface (see Pull Request https://github.com/determined-ai/determined-ee/pull/227).
However, Bradley found that the Nvidia NCCL code favors Infiniband.
https://github.com/NVIDIA/nccl/blob/2dfd83752cc17e7962fb2842c44bfcc0216c8b40/src/misc/socket.cc#L289-L302
// Try to automatically pick the right one
// Start with IB
nIfs = findInterfaces("ib", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
// else see if we can get some hint from COMM ID
if (nIfs == 0) {
Therefore, we should probably not try to autodetect it ourselves, even though on the "horizon" cluster, which does not have Infiniband, the NCCL code picks the wrong interface.
Instead, let's address the issue where the "dtrain_network_interface", which can be specified by the user in the "master.yaml", is not being used by deepspeed to set the NCCL_SOCKET_IFNAME.
Test Plan
Tested on the "horizon" cluster:
- Not setting dtrain_network_interface or the NCCL_SOCKET_IFNAME environment variable results in the NCCL code picking the wrong network interface, so the test fails.
- Setting dtrain_network_interface in "master.yaml" to the correct network interface results in the experiment running successfully.
- Not setting dtrain_network_interface in "master.yaml", but setting the NCCL_SOCKET_IFNAME environment variable in the experiment's YAML file to the correct network interface, results in the experiment running successfully.
Commentary (optional)
Checklist
- [ ] User-facing API changes need the "User-facing API Change" label.
- [ ] Release notes should be added as a separate file under
docs/release-notes/. See Release Note for details. - [ ] Licenses should be included for new code which was copied and/or modified from any external code.
Deploy Preview for determined-ui canceled.
| Name | Link |
|---|---|
| Latest commit | 479c21dab37f5a21fe3c0d8bf8fc057532972ac2 |
| Latest deploy log | https://app.netlify.com/sites/determined-ui/deploys/62a258db3d7c8e0009af3b20 |
@rcorujo can we close this one?
Closing. Was committed via a different PR https://github.com/determined-ai/determined/pull/4297