intel-device-plugins-for-kubernetes icon indicating copy to clipboard operation
intel-device-plugins-for-kubernetes copied to clipboard

Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory

Open moophlo opened this issue 1 year ago • 1 comments

Describe the bug Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory

To Reproduce Steps to reproduce the behavior: Just start the pod

Expected behavior Expect the file to be there

Screenshots If applicable, add screenshots to help explain your problem.

System (please complete the following information):

  • OS version: Mint 22
  • Kernel version:6.8.0-47-generic
  • Device plugins version: intel/intel-gpu-plugin:0.31.0
  • Hardware info: [e.g. SPR with QAT]

Additional context

I1015 17:50:35.074780       1 gpu_plugin_resource_manager.go:174] GPU device plugin resource manager enabled
W1015 17:50:40.075999       1 gpu_plugin_resource_manager.go:315] Failed to read pods from kubelet API: Get "https://192.168.10.15:10250/pods": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
W1015 17:50:40.082039       1 labeler.go:176] Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory
W1015 17:55:40.327845       1 labeler.go:176] Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory
W1015 17:56:17.135634       1 gpu_plugin_resource_manager.go:645] Pending POD annotations from scheduler not yet visible for pod "intel-gpu"
W1015 17:56:17.135662       1 gpu_plugin_resource_manager.go:461] retrying POD resolving after sleeping
W1015 17:56:19.431164       1 gpu_plugin_resource_manager.go:645] Pending POD annotations from scheduler not yet visible for pod "intel-gpu"
E1015 17:56:19.431252       1 gpu_plugin_resource_manager.go:469] allocation candidate not found, perhaps the GPU scheduler extender is not called, err:things didn't work out, but perhaps a retry will help
W1015 17:56:20.529522       1 gpu_plugin_resource_manager.go:645] Pending POD annotations from scheduler not yet visible for pod "intel-gpu"
W1015 17:56:20.529585       1 gpu_plugin_resource_manager.go:398] retrying POD resolving after sleeping
W1015 17:56:22.831799       1 gpu_plugin_resource_manager.go:645] Pending POD annotations from scheduler not yet visible for pod "intel-gpu"
E1015 17:56:22.831862       1 gpu_plugin_resource_manager.go:406] allocation candidate not found, perhaps the GPU scheduler extender is not called, err:things didn't work out, but perhaps a retry will help
W1015 18:00:40.329390       1 labeler.go:176] Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory
W1015 18:05:26.528397       1 gpu_plugin_resource_manager.go:645] Pending POD annotations from scheduler not yet visible for pod "intel-gpu"
W1015 18:05:26.528495       1 gpu_plugin_resource_manager.go:461] retrying POD resolving after sleeping
W1015 18:05:28.634033       1 gpu_plugin_resource_manager.go:645] Pending POD annotations from scheduler not yet visible for pod "intel-gpu"
E1015 18:05:28.724962       1 gpu_plugin_resource_manager.go:469] allocation candidate not found, perhaps the GPU scheduler extender is not called, err:things didn't work out, but perhaps a retry will help
W1015 18:05:29.229192       1 gpu_plugin_resource_manager.go:645] Pending POD annotations from scheduler not yet visible for pod "intel-gpu"
W1015 18:05:29.229208       1 gpu_plugin_resource_manager.go:398] retrying POD resolving after sleeping
W1015 18:05:30.901631       1 gpu_plugin_resource_manager.go:645] Pending POD annotations from scheduler not yet visible for pod "intel-gpu"
E1015 18:05:30.901664       1 gpu_plugin_resource_manager.go:406] allocation candidate not found, perhaps the GPU scheduler extender is not called, err:things didn't work out, but perhaps a retry will help
W1015 18:05:40.331411       1 labeler.go:176] Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory
W1015 18:10:40.332615       1 labeler.go:176] Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory
W1015 18:15:40.335025       1 labeler.go:176] Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory
W1015 18:20:40.337430       1 labeler.go:176] Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory
W1015 18:25:40.338627       1 labeler.go:176] Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory

moophlo avatar Oct 15 '24 18:10 moophlo

Can't read file: open /sys/class/drm/card3/lmem_total_bytes: no such file or directory

Depending on which GPU HW you have, and which kernel driver you use for it, this message is expected:

  • Only discrete Intel GPUs include device local memory
  • That info is provided through sysfs only with the the out-of-tree Intel DKMS driver (prelim uAPI), not with the GPU kernel driver in upstream kernel. See:
    • KMD types: https://dgpu-docs.intel.com/driver/kernel-driver-types.html
    • Upstream uAPI: https://docs.kernel.org/gpu/driver-uapi.html#c.drm_i915_query_memory_regions

eero-t avatar Oct 16 '24 11:10 eero-t

Memory information is relevant only with GAS resource management, which is deprecated now that K8s has official support for DRA driver, see https://github.com/intel/intel-device-plugins-for-kubernetes/issues/1948.

eero-t avatar Jul 16 '25 09:07 eero-t