carlwang87
carlwang87
I use the Dockerfile from https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 to build driver image with the following command: ` docker build -t hsc/driver:510.47.03-rhel8.4 --build-arg CUDA_VERSION=11.6.0 --build-arg TARGETARCH=x86_64 --build-arg DRIVER_VERSION=510.47.03 --no-cache . ` The image...
> @KodieGlosserIBM @relyt0925 We will publish driver images with RHEL8 tags during our September release. When will driver images with RHEL8 tags be published? Thanks.
@shivamerla I built RHEL8 driver image with the code from https://gitlab.com/nvidia/container-images/driver/-/tree/master/rhel8 . Then I deployed driver image on k3s cluster environment, and the environment is air-gap, and pod logs: ...
Images list ``` "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0", "nvcr.io/nvidia/gpu-operator:v22.9.0", "nvcr.io/nvidia/cuda:11.7.1-base-ubi8", "nvcr.io/nvidia/driver:515.65.01-rhel8.4", "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.4.2", "nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubi8", "nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8", "nvcr.io/nvidia/cloud-native/dcgm:3.0.4-1-ubi8", "nvcr.io/nvidia/k8s/dcgm-exporter:3.0.4-3.0.0-ubi8", "nvcr.io/nvidia/gpu-feature-discovery:v0.6.2-ubi8", "nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.5.0-ubi8", "nvcr.io/nvidia/cloud-native/vgpu-device-manager:v0.2.0", "nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.1", "k8s.gcr.io/nfd/node-feature-discovery:v0.10.1" ```
Follow the document toolkit configuration, it failed either. ` cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml` ``` [plugins.cri.containerd.runtimes."nvidia"] runtime_type = "io.containerd.runc.v2" [plugins.cri.containerd.runtimes."nvidia".options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime" [plugins.cri.containerd.runtimes."nvidia-experimental"] runtime_type = "io.containerd.runc.v2" [plugins.cri.containerd.runtimes."nvidia-experimental".options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental" ``` The...
@shivamerla Without pre-installed NVIDIA Container Toolkit and gpu driver, I followed the gpu-operator(v22.9.0) installation guide in k3s(v1.24.3+k3s1) to deploy gpu operator successfully, but I ran the samples from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#running-sample-gpu-applications, it...
@shivamerla I set it through value.yaml as below: ``` toolkit: enabled: true repository: nvcr.io/nvidia/k8s image: container-toolkit version: v1.11.0-ubi8 env: - name: CONTAINERD_CONFIG value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml - name: CONTAINERD_SOCKET value: /run/k3s/containerd/containerd.sock -...
@shivamerla First of all, thank you for still helping me. > With `CONTAINERD_SET_AS_DEFAULT` enabled, we set `default_runtime_name=nvidia` in `/var/lib/rancher/k3s/agent/etc/containerd/config.toml` `default_runtime_name=nvidia` in `/var/lib/rancher/k3s/agent/etc/containerd/config.toml` is set manually or by gpu operator? I...
@shivamerla logs from pod `nvidia-container-toolkit-daemonset-cwp28`: env:  logs: ``` time="2022-10-20T12:36:25Z" level=info msg="Starting nvidia-toolkit" time="2022-10-20T12:36:25Z" level=info msg="Parsing arguments" time="2022-10-20T12:36:25Z" level=info msg="Verifying Flags" time="2022-10-20T12:36:25Z" level=info msg=Initializing time="2022-10-20T12:36:25Z" level=info msg="Installing toolkit" time="2022-10-20T12:36:25Z" level=info...
@shivamerla I have set up a K3S cluster environment with GPU. In this cluster, it can reproduce the issue, no `default_runtime_name=nvidia` in /var/lib/rancher/k3s/agent/etc/containerd/config.toml. K3S Version: `v1.25.3+k3s1` GPU Operator: `v22.9.0` So,...