gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Grace Hopper 200 (GH200) install recommendations?

Open joshuacox opened this issue 9 months ago • 3 comments

I have a few Grace Hopper 200's that I am trying to cluster up using k8s.

On the host i have the 560 drivers running from the repos:

cat /etc/apt/sources.list.d/nvidia-container-toolkit.list 
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /

nvidia-smi
Sat Mar 15 16:44:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 480GB             Off |   00000009:01:00.0 Off |                    0 |
| N/A   27C    P0             73W /  900W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I can also upgrade to 570 using the nvidia run scripts, if anyone thinks that would work better.

Currently I have the gpu-operator installed as thus:

helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator

Which after a bit the logs from a feature-discovery container start like this:

kl -n gpu-operator gpu-feature-discovery-vnvvf
I0315 15:59:46.358279       1 main.go:163] Starting OS watcher.
I0315 15:59:46.358425       1 main.go:168] Loading configuration.
I0315 15:59:46.358648       1 main.go:180] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "useNodeFeatureAPI": false,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": null,
      "deviceListStrategy": null,
      "deviceIDStrategy": null,
      "cdiAnnotationPrefix": null,
      "nvidiaCTKPath": null,
      "containerDriverRoot": "/driver-root"
    },
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0315 15:59:46.370695       1 factory.go:49] Using NVML manager
I0315 15:59:46.370708       1 main.go:210] Start running
I0315 15:59:46.397097       1 main.go:274] Creating Labels
I0315 15:59:46.397110       1 output.go:82] Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I0315 15:59:46.397596       1 main.go:283] Sleeping for 60000000000

Its the "gpus": null that concerns me, and sure enough I don't seem to be able to run gpu loads?

I can try with the toolkit enabled as well:

helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.enabled=false \
--set driver.enabled=true \
--set nfd.enabled=true \
nvidia/gpu-operator

Where I get logs like this:

[nvidia-ctk]
  path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2025-03-15T16:56:12Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2025-03-15T16:56:12Z" level=info msg="Using config version 2"
time="2025-03-15T16:56:12Z" level=info msg="Using CRI runtime plugin name \"io.containerd.grpc.v1.cri\""
time="2025-03-15T16:56:12Z" level=info msg="Flushing config to /runtime/config-dir/config.toml"
time="2025-03-15T16:56:12Z" level=info msg="Sending SIGHUP signal to containerd"
time="2025-03-15T16:56:12Z" level=warning msg="Error signaling containerd, attempt 1/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:17Z" level=warning msg="Error signaling containerd, attempt 2/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:22Z" level=warning msg="Error signaling containerd, attempt 3/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:27Z" level=warning msg="Error signaling containerd, attempt 4/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:32Z" level=warning msg="Error signaling containerd, attempt 5/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-03-15T16:56:37Z" level=warning msg="Max retries reached 6/6, aborting"
time="2025-03-15T16:56:37Z" level=info msg="Shutting Down"
time="2025-03-15T16:56:37Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: unable to restart containerd: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"

But I think that is because container toolkit is installed on the host? Or should this /runtime/sock-dir be set set? I'm uncertain and wondering if there was any advice out there.

joshuacox avatar Mar 15 '25 16:03 joshuacox

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

➜  ~ nvidia-smi
Wed Nov  5 15:53:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 480GB             On  |   00000009:01:00.0 Off |                    0 |
| N/A   30C    P0            115W /  900W |     656MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           37486      C   ...env/versions/comfy/bin/python        648MiB |
+-----------------------------------------------------------------------------------------+

but still have issues with the gpu-operator

joshuacox avatar Nov 05 '25 15:11 joshuacox

@joshuacox which k8s version/variant is this? Also, please run the must-gather script and collect logs for us to be able to debug this further.

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

if this is with RKE, we need to set the following env with the container toolkit. Please refer to our docs for this.

              - name: CONTAINERD_CONFIG
                value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
              - name: CONTAINERD_SOCKET
                value: /run/k3s/containerd/containerd.sock

shivamerla avatar Nov 17 '25 18:11 shivamerla