Grace Hopper 200 (GH200) install recommendations?
I have a few Grace Hopper 200's that I am trying to cluster up using k8s.
On the host i have the 560 drivers running from the repos:
cat /etc/apt/sources.list.d/nvidia-container-toolkit.list
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
nvidia-smi
Sat Mar 15 16:44:47 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 480GB Off | 00000009:01:00.0 Off | 0 |
| N/A 27C P0 73W / 900W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I can also upgrade to 570 using the nvidia run scripts, if anyone thinks that would work better.
Currently I have the gpu-operator installed as thus:
helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator
Which after a bit the logs from a feature-discovery container start like this:
kl -n gpu-operator gpu-feature-discovery-vnvvf
I0315 15:59:46.358279 1 main.go:163] Starting OS watcher.
I0315 15:59:46.358425 1 main.go:168] Loading configuration.
I0315 15:59:46.358648 1 main.go:180]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"gdsEnabled": null,
"mofedEnabled": null,
"useNodeFeatureAPI": false,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": null,
"deviceListStrategy": null,
"deviceIDStrategy": null,
"cdiAnnotationPrefix": null,
"nvidiaCTKPath": null,
"containerDriverRoot": "/driver-root"
},
"gfd": {
"oneshot": false,
"noTimestamp": false,
"sleepInterval": "1m0s",
"outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
"machineTypeFile": "/sys/class/dmi/id/product_name"
}
},
"resources": {
"gpus": null
},
"sharing": {
"timeSlicing": {}
},
"imex": {}
}
I0315 15:59:46.370695 1 factory.go:49] Using NVML manager
I0315 15:59:46.370708 1 main.go:210] Start running
I0315 15:59:46.397097 1 main.go:274] Creating Labels
I0315 15:59:46.397110 1 output.go:82] Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I0315 15:59:46.397596 1 main.go:283] Sleeping for 60000000000
Its the "gpus": null that concerns me, and sure enough I don't seem to be able to run gpu loads?
I can try with the toolkit enabled as well:
helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator
Where I get logs like this:
[nvidia-ctk]
path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2025-03-15T16:56:12Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2025-03-15T16:56:12Z" level=info msg="Using config version 2"
time="2025-03-15T16:56:12Z" level=info msg="Using CRI runtime plugin name \"io.containerd.grpc.v1.cri\""
time="2025-03-15T16:56:12Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2025-03-15T16:56:12Z" level=info msg="Sending SIGHUP signal to containerd"
time="2025-03-15T16:56:12Z" level=warning msg="Error signaling containerd, attempt 1/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:17Z" level=warning msg="Error signaling containerd, attempt 2/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:22Z" level=warning msg="Error signaling containerd, attempt 3/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:27Z" level=warning msg="Error signaling containerd, attempt 4/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:32Z" level=warning msg="Error signaling containerd, attempt 5/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:37Z" level=warning msg="Max retries reached 6/6, aborting"
time="2025-03-15T16:56:37Z" level=info msg="Shutting Down"
time="2025-03-15T16:56:37Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: unable to restart containerd: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
But I think that is because container toolkit is installed on the host? Or should this /runtime/sock-dir be set set? I'm uncertain and wondering if there was any advice out there.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
➜ ~ nvidia-smi
Wed Nov 5 15:53:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | 0 |
| N/A 30C P0 115W / 900W | 656MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 37486 C ...env/versions/comfy/bin/python 648MiB |
+-----------------------------------------------------------------------------------------+
but still have issues with the gpu-operator
@joshuacox which k8s version/variant is this? Also, please run the must-gather script and collect logs for us to be able to debug this further.
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
if this is with RKE, we need to set the following env with the container toolkit. Please refer to our docs for this.
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock