determined chore: FOUNDENG-55 For deepspeed, set the NCCL_SOCKET_IFNAME env variable based on dtrain_network

Description

There are 3 issues being addressed in this PR.

The first issue is that the "LD_LIBRARY_PATH" variable was not in the INCLUDE list of environment variables that needed to be passed to deepspeed, so the experiment failed with the following error, because the CUDA libraries could not be found.

[rank=0] RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

The second issue is that the "/tmp/hostfile.txt" is being written to a shared "/tmp" directory, so all users write to the same file. If the file already exists and is owned by another user, it will fail to be created by the current user. It is also a security issue to write to well-known file names in a world-writable directory, such as "/tmp", because it's vulnerable to a symlink attack.

The third issue is that the NCCL library was picking the wrong interface. The details are as follows:

On the "horizon" cluster, which does not have Infiniband (has the Gemini Fabric (IPoGIF) module), the Nvidia NCCL code is picking up the wrong interface. It is picking "rsip", which has a "172.x.x.x" address, instead of "ipogif0", which has a "10.x.x.x" address.

nid00153:~ # ifconfig -a
ipogif0: flags=193<UP,RUNNING,NOARP>  mtu 65520
        inet 10.128.0.154  netmask 255.252.0.0
        ether 00:01:01:00:00:99  txqueuelen 1000  (Ethernet)
        RX packets 225854  bytes 1042470410 (994.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 231281  bytes 1038038963 (989.9 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 539451  bytes 29386310 (28.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 539451  bytes 29386310 (28.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

rsip: flags=209<UP,POINTOPOINT,RUNNING,NOARP>  mtu 1520
        inet 172.30.48.181  netmask 255.255.255.255  destination 172.30.48.181
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 18378  bytes 20327676 (19.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 16270  bytes 1448965 (1.3 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tunl0: flags=128<NOARP>  mtu 1480
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

As a result, the experiment fails because the workers cannot communicate with the chief over the "172.x.x.x" address.

corujor@horizon:~> grep 181 /home/users/launcher/.capsules/dispatcher/environments/corujor/f56e0a24780b4d3e-95d673d2c3e54aad/ai_exp-761-trial-738-output.log
[rank=1] nid00154:32841:32841 [0] NCCL INFO NET/Socket : Using [0]ipogif0:10.128.0.155<0> [1]rsip:172.30.48.181<0>
[rank=0] nid00153:3063:3063 [0] NCCL INFO NET/Socket : Using [0]ipogif0:10.128.0.154<0> [1]rsip:172.30.48.181<0>
[rank=1] nid00154:32841:32964 [0] include/socket.h:409 NCCL WARN Net : Connect to 172.30.48.181<46127> failed : Connection refused
[rank=0] nid00153:3063:3189 [0] include/socket.h:409 NCCL WARN Net : Connect to 172.30.48.181<37161> failed : Connection refused

The user does have the ability to set the distributed training network interface in the "master.yaml".

        task_container_defaults:
          #  shm_size_bytes: 4294967296
          #  network_mode: bridge
          dtrain_network_interface: ipogif0
          #  nccl_port_range: <MIN:MAX>
          #  gloo_port_range: <MIN:MAX>

However, unlike Horovod, which gets passed the distributed training network interface as a parameter (see below), "deepspeed.py" does not.

hvd_cmd = horovod.create_run_command(
        num_proc_per_machine=len(info.slot_ids),
        ip_addresses=info.container_addrs,
        inter_node_network_interface=info.trial._inter_node_network_interface,
        optimizations=experiment_config["optimizations"],
        debug=debug,
        optional_args=hvd_optional_args,
    )

We had originally modified "deepspeed.py" in the "dispatcher" branch to automatically pick the network interface that is used to connect to the chief's IP addresss and set NCCL_SOCKET_IFNAME to that interface (see Pull Request https://github.com/determined-ai/determined-ee/pull/227).

However, Bradley found that the Nvidia NCCL code favors Infiniband.

https://github.com/NVIDIA/nccl/blob/2dfd83752cc17e7962fb2842c44bfcc0216c8b40/src/misc/socket.cc#L289-L302

    // Try to automatically pick the right one
    // Start with IB
    nIfs = findInterfaces("ib", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
    // else see if we can get some hint from COMM ID
    if (nIfs == 0) {

Therefore, we should probably not try to autodetect it ourselves, even though on the "horizon" cluster, which does not have Infiniband, the NCCL code picks the wrong interface.

Instead, let's address the issue where the "dtrain_network_interface", which can be specified by the user in the "master.yaml", is not being used by deepspeed to set the NCCL_SOCKET_IFNAME.

Test Plan

Tested on the "horizon" cluster:

Not setting dtrain_network_interface or the NCCL_SOCKET_IFNAME environment variable results in the NCCL code picking the wrong network interface, so the test fails.
Setting dtrain_network_interface in "master.yaml" to the correct network interface results in the experiment running successfully.
Not setting dtrain_network_interface in "master.yaml", but setting the NCCL_SOCKET_IFNAME environment variable in the experiment's YAML file to the correct network interface, results in the experiment running successfully.

Commentary (optional)

Checklist

[ ] User-facing API changes need the "User-facing API Change" label.
[ ] Release notes should be added as a separate file under docs/release-notes/. See Release Note for details.
[ ] Licenses should be included for new code which was copied and/or modified from any external code.

Jun 09 '22 20:06 rcorujo

Deploy Preview for determined-ui canceled.

Name	Link
Latest commit	479c21dab37f5a21fe3c0d8bf8fc057532972ac2
Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/62a258db3d7c8e0009af3b20

Jun 09 '22 20:06 netlify[bot]

@rcorujo can we close this one?

Dec 05 '22 18:12 stoksc

Closing. Was committed via a different PR https://github.com/determined-ai/determined/pull/4297

Jan 18 '23 20:01 rcorujo

determined
determined copied to clipboard

chore: FOUNDENG-55 For deepspeed, set the NCCL_SOCKET_IFNAME env variable based on dtrain_network_interface

Description

Test Plan

Commentary (optional)

Checklist

Deploy Preview for determined-ui canceled.

determined determined copied to clipboard

chore: FOUNDENG-55 For deepspeed, set the NCCL_SOCKET_IFNAME env variable based on dtrain_network_interface

Description

Test Plan

Commentary (optional)

Checklist

✅ Deploy Preview for determined-ui canceled.

determined
determined copied to clipboard

Deploy Preview for determined-ui canceled.