gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

/sys/module/firmware_class/parameters/path: Read-only file system -- nvidia-vgpu-manager-daemonset

Open davidhwua opened this issue 1 year ago • 0 comments

1. Quick Debug Information

  • OS/Version:Ubuntu22.04
  • Kernel Version: Linux 5.15.0-112-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): nvida
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s 1.80
  • GPU Operator Version: gpu-operator-v24.3.0v24.3.0
  • Tesla T4

2. Issue or feature description

When I install the vgpu-manager, the error appears to be unable to write custom firmware "/run/nvidia/driver/lib/firmware/module/firmware" to GPU server /sys/module/firmware_class/parameters/path.I manually write this line to /run/nvidia/driver/lib/firmware/module/firmware was passed,can somebody help me how to pass this error?

3. Steps to reproduce the issue

helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator \ --set sandboxWorkloads.enabled=true \ --set vgpuManager.enabled=true \ --set toolkit.enabled=true \ --set vgpuManager.repository=registry.xxxx.com/ \ --set vgpuManager.image=vgpu-manager \ --set vgpuManager.version=550.54.10 \ --set vgpuManager.imagePullSecrets={***} --timeout 3600s

4. Information to attach (optional if deemed irrelevant)

Pods root@k8s-gpu1:~# getpod NAME READY STATUS RESTARTS AGE gpu-operator-1718359653-node-feature-discovery-gc-644fbf54zsjcm 1/1 Running 0 19m gpu-operator-1718359653-node-feature-discovery-master-7f5f4dg2q 1/1 Running 0 19m gpu-operator-1718359653-node-feature-discovery-worker-cnblz 1/1 Running 0 19m gpu-operator-1718359653-node-feature-discovery-worker-vhtxp 1/1 Running 0 19m gpu-operator-66567bfffb-k9ltp 1/1 Running 0 19m nvidia-sandbox-device-plugin-daemonset-48zr6 0/1 Init:1/2 0 18m nvidia-sandbox-validator-5bbkv 0/1 Init:1/3 0 18m nvidia-vgpu-device-manager-zwjqh 0/1 Init:0/1 0 18m nvidia-vgpu-manager-daemonset-4fp89 1/1 Running 3 (5m3s ago) 19m

Error : `+ _set_fw_search_path

  • local nv_fw_search_path=/run/nvidia/driver/lib/firmware
  • local fw_path_config_file=/sys/module/firmware_class/parameters/path ++ grep '[^[:space:]]' /sys/module/firmware_class/parameters/path
  • [[ ! -z '' ]]
  • echo 'Configuring the following firmware search path in '''/sys/module/firmware_class/parameters/path''': /run/nvidia/driver/lib/firmware'
  • echo -n /run/nvidia/driver/lib/firmware /usr/local/bin/nvidia-driver: line 132: /sys/module/firmware_class/parameters/path: Read-only file system
  • popd /usr/local/bin/nvidia-driver: line 1: popd: directory stack empty Configuring the following firmware search path in '/sys/module/firmware_class/parameters/path': /run/nvidia/driver/lib/firmware`

./must-gather.sh

Saving nvidia-bug-report from k8s-gpu2 ... Failed to collect nvidia-bug-report from k8s-gpu2 tar: Removing leading `/' from member names tar: /tmp/nvidia-bug-report.log.gz: Cannot stat: No such file or directory tar: Exiting with failure status due to previous errors

`

davidhwua avatar Jun 14 '24 10:06 davidhwua