DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

Add support for IPv6

Open johnsushant opened this issue 7 months ago • 9 comments

Fixes #150

I've added IPv6 support to Hostengine and dcgmi CLI while not changing/breaking any existing functionality.

Hostengine now supports binding to an IPv6 address

Start hostnegine:

$ nv-hostengine -b [::1] --log-level debug
Started host engine version 3.3.6 using port number: 5555

Confirm using lsof:

$ sudo lsof -i :5555
COMMAND       PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
nv-hosten 2004760 root   47u  IPv6 3165553609      0t0  TCP localhost:personal-agent (LISTEN)

Connect using dcgmi without port:

$ dcgmi discovery -l --host [::1]
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Connect using dcgmi with port:

$ dcgmi discovery -l --host [::1]:5555
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Hostengine now supports both IPv4 and IPv6 connections

Start hostnegine:

$ nv-hostengine -b ALL --log-level debug
Started host engine version 3.3.6 using port number: 5555

Confirm using lsof:

$ sudo lsof -i :5555
COMMAND       PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
nv-hosten 2478925 root   47u  IPv6 3173352315      0t0  TCP *:personal-agent (LISTEN)

Connect using dcgmi on IPv4:

$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Connect using dcgmi on IPv6:

$ dcgmi discovery -l --host [::1]
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Hostengine default IPv4 functionality is not broken

Start hostnegine:

$ nv-hostengine --log-level debug
Started host engine version 3.3.6 using port number: 5555

Confirm using lsof:

$ sudo lsof -i :5555
COMMAND       PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
nv-hosten 2476116 root   47u  IPv4 3173251779      0t0  TCP localhost:personal-agent (LISTEN)

Connect using dcgmi on IPv4:

$ dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA PG509-210                                               |
|        | PCI Bus ID: 00000000:04:00.0                                         |
|        | Device UUID: GPU-de6a7a6a-776e-e6e2-bd3d-d8114ccf6db2                |

Connect using dcgmi on IPv6 (expected failure):

$ dcgmi discovery -l --host [::1]
Error: unable to establish a connection to the specified host: [::1]
Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

johnsushant avatar Jul 14 '24 20:07 johnsushant