intel-device-plugins-for-kubernetes
intel-device-plugins-for-kubernetes copied to clipboard
GPU Resources arent made available after Updated to newest intel-basekitpackages
Describe the bug
After i updated the newest intel-basekit packages that are on debian (2025.0.1-45) its currently for me not possible to schedule pods to GPU's because Allocate failed due to requested number of devices unavailable for gpu.intel.com/i915. Requested: 1, Available: 0, which is unexpected
System (please complete the following information):
- OS version: Debian 12.8
- Kernel version: Linux 6.8.12-5-pve # 1 SMP PREEMPT_DYNAMIC PMX 6.8.12-5 (2024-12-03T10:26Z) x86_64 GNU/Linux
- Device plugins version: intel/intel-gpu-plugin:0.31.1
- Hardware info:
- CPU: Intel(R) Core(TM) i7-4790
- GPU: Intel Arc A770
Additional context The Node itself has in its status a Capacity and allocatable numbers for gpu.intel.com/i915 as i configured the sharedDevNum. The intel-gpu-plugin Pod also sets them. Here is a log ouput with Loglevel 5
I1223 03:27:06.949113 1 gpu_plugin.go:799] GPU device plugin started with none preferred allocation policy
I1223 03:27:06.949917 1 gpu_plugin_resource_manager.go:174] GPU device plugin resource manager enabled
I1223 03:27:06.950005 1 gpu_plugin_resource_manager.go:311] Requesting pods from kubelet (https://192.168.178.118:10250/pods)
W1223 03:27:06.959157 1 gpu_plugin_resource_manager.go:315] Failed to read pods from kubelet API: Get "https://192.168.178.118:10250/pods": tls: failed to verify certificate: x509: certificate signed by unknown authority
I1223 03:27:06.959191 1 gpu_plugin_resource_manager.go:180] Not using Kubelet API
I1223 03:27:06.959250 1 gpu_plugin.go:835] NFD feature file location: /etc/kubernetes/node-feature-discovery/features.d/intel-gpu-resources.txt
I1223 03:27:06.959282 1 gpu_plugin.go:518] GPU (i915/xe) resource share count = 120
I1223 03:27:06.959450 1 gpu_plugin.go:565] Not compatible device: card0-DP-1
I1223 03:27:06.959463 1 gpu_plugin.go:565] Not compatible device: card0-DP-2
I1223 03:27:06.959471 1 gpu_plugin.go:565] Not compatible device: card0-DP-3
I1223 03:27:06.959478 1 gpu_plugin.go:565] Not compatible device: card0-HDMI-A-2
I1223 03:27:06.959485 1 gpu_plugin.go:565] Not compatible device: card0-HDMI-A-3
I1223 03:27:06.959491 1 gpu_plugin.go:565] Not compatible device: card0-HDMI-A-4
I1223 03:27:06.959498 1 gpu_plugin.go:565] Not compatible device: card0-HDMI-A-5
I1223 03:27:06.959564 1 gpu_plugin.go:565] Not compatible device: card1-HDMI-A-1
I1223 03:27:06.959574 1 gpu_plugin.go:565] Not compatible device: card1-VGA-1
I1223 03:27:06.959579 1 gpu_plugin.go:565] Not compatible device: renderD128
I1223 03:27:06.959584 1 gpu_plugin.go:565] Not compatible device: renderD129
I1223 03:27:06.959583 1 labeler.go:480] Starting GPU labeler
I1223 03:27:06.959591 1 gpu_plugin.go:565] Not compatible device: version
I1223 03:27:06.959724 1 labeler.go:219] tile files found:[/sys/class/drm/card0/gt/gt0]
I1223 03:27:06.959792 1 gpu_plugin.go:636] Adding /dev/dri/card0 to GPU card0
I1223 03:27:06.959804 1 gpu_plugin.go:636] Adding /dev/dri/renderD129 to GPU card0
I1223 03:27:06.960346 1 gpu_plugin.go:726] For i915_monitoring/all, adding nodes: [{ContainerPath:/dev/dri/card0 HostPath:/dev/dri/card0 Permissions:rw XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {ContainerPath:/dev/dri/renderD129 HostPath:/dev/dri/renderD129 Permissions:rw XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}]
I1223 03:27:06.960483 1 labeler.go:219] tile files found:[/sys/class/drm/card1/gt/gt0]
I1223 03:27:06.960532 1 gpu_plugin.go:636] Adding /dev/dri/card1 to GPU card1
I1223 03:27:06.960546 1 gpu_plugin.go:636] Adding /dev/dri/renderD128 to GPU card1
I1223 03:27:06.960977 1 gpu_plugin.go:726] For i915_monitoring/all, adding nodes: [{ContainerPath:/dev/dri/card1 HostPath:/dev/dri/card1 Permissions:rw XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {ContainerPath:/dev/dri/renderD128 HostPath:/dev/dri/renderD128 Permissions:rw XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}]
I1223 03:27:07.047963 1 gpu_plugin.go:540] GPU scan update: 0->240 'i915' resources found
I1223 03:27:07.047976 1 gpu_plugin.go:540] GPU scan update: 0->1 'i915_monitoring' resources found
I1223 03:27:07.047993 1 labeler.go:495] Ext resources scanning
I1223 03:27:07.048130 1 labeler.go:122] Not compatible devicecard0-DP-1
I1223 03:27:07.048140 1 labeler.go:122] Not compatible devicecard0-DP-2
I1223 03:27:07.048146 1 labeler.go:122] Not compatible devicecard0-DP-3
I1223 03:27:07.048153 1 labeler.go:122] Not compatible devicecard0-HDMI-A-2
I1223 03:27:07.048159 1 labeler.go:122] Not compatible devicecard0-HDMI-A-3
I1223 03:27:07.048165 1 labeler.go:122] Not compatible devicecard0-HDMI-A-4
I1223 03:27:07.048171 1 labeler.go:122] Not compatible devicecard0-HDMI-A-5
I1223 03:27:07.048250 1 labeler.go:122] Not compatible devicecard1-HDMI-A-1
I1223 03:27:07.048259 1 labeler.go:122] Not compatible devicecard1-VGA-1
I1223 03:27:07.048264 1 labeler.go:122] Not compatible devicerenderD128
I1223 03:27:07.048270 1 labeler.go:122] Not compatible devicerenderD129
I1223 03:27:07.048276 1 labeler.go:122] Not compatible deviceversion
I1223 03:27:07.048442 1 labeler.go:219] tile files found:[/sys/class/drm/card0/gt/gt0]
W1223 03:27:07.048482 1 labeler.go:176] Can't read file: open /sys/class/drm/card0/lmem_total_bytes: no such file or directory
I1223 03:27:07.048693 1 labeler.go:219] tile files found:[/sys/class/drm/card1/gt/gt0]
W1223 03:27:07.048728 1 labeler.go:176] Can't read file: open /sys/class/drm/card1/lmem_total_bytes: no such file or directory
I1223 03:27:07.048797 1 labeler.go:505] Writing labels
I1223 03:27:07.048013 1 manager.go:115] Received dev updates:{map[i .. (shortened due to maximum charactrs) ... ]}
I1223 03:27:08.148626 1 server.go:285] Start server for i915 at: /var/lib/kubelet/device-plugins/gpu.intel.com-i915.sock
I1223 03:27:08.148633 1 server.go:285] Start server for i915_monitoring at: /var/lib/kubelet/device-plugins/gpu.intel.com-i915_monitoring.sock
I1223 03:27:08.159803 1 server.go:128] Started ListAndWatch fori915
I1223 03:27:08.159822 1 server.go:117] Sending to kubelet[]
I1223 03:27:08.159909 1 server.go:303] Device plugin for i915 registered
I1223 03:27:08.160059 1 server.go:117] Sending to kubelet[&Device{ID:card1-33,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-46,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-62,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-72,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-51,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-116,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-41,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-63,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-115,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-61,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-31,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-75,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-33,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-39,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-73,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-108,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-77,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-101,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-19,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-29,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-22,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-71,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-106,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-40,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-62,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-80,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-83,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-17,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-96,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-110,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-30,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-111,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-113,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-14,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-59,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-78,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-95,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-44,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-109,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-100,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-81,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-26,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-68,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-119,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-10,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-24,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-42,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-105,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-46,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-53,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-117,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-10,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-51,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-114,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-30,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-54,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-111,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-49,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-72,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-49,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-119,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-98,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-13,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-99,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-84,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-89,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-91,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-16,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-84,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-89,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-45,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-64,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-27,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-96,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-117,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-17,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-94,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-103,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-95,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-45,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-54,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-55,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-70,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-43,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-58,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-78,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-91,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-32,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-57,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-61,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-74,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-109,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-22,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-56,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-43,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-56,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-25,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-19,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-21,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-40,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-35,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-93,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-69,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-38,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-71,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-79,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-85,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-15,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-87,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-101,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-23,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-47,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-53,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-65,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-118,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-12,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-31,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-76,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-98,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-65,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-42,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-50,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-20,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-67,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-108,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-118,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-8,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-81,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-9,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-74,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-93,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-8,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-59,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-63,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-82,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-104,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-14,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-24,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-48,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-67,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-64,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-90,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-102,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-104,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-50,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-102,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-38,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-94,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-52,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-37,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-86,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-69,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-87,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-66,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-75,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-107,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-83,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-28,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-36,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-11,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-57,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-82,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-58,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-60,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-100,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-32,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-86,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-88,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-79,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-26,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-44,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-52,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-97,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-97,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-16,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-80,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-73,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-9,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-23,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-55,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-116,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-34,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-112,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-92,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-68,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-48,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-66,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-70,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-13,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-47,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-18,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-112,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-12,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-37,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-115,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-18,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-34,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-90,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-21,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-113,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-103,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-85,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-36,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-29,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-35,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-76,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-105,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-107,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-41,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-106,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-15,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-28,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-20,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-114,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-25,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-60,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-110,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-77,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-88,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-11,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card1-39,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-27,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-92,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},} &Device{ID:card0-99,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},}]
I1223 03:27:08.248821 1 server.go:303] Device plugin for i915_monitoring registered
I1223 03:27:08.248987 1 server.go:128] Started ListAndWatch fori915_monitoring
I1223 03:27:08.248997 1 server.go:117] Sending to kubelet[]
I1223 03:27:08.249032 1 server.go:117] Sending to kubelet[&Device{ID:all,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{},},}]
I1223 03:27:11.949496 1 gpu_plugin.go:565] Not compatible device: card0-DP-1
Hi @Serverfrog and thanks for a detailed issue report! Logs seem to indicate plugin working correctly with no errors that would cause things to fail.
Installing user space libraries to host shouldn't cause things to fail in containers. Unless intel-basekit also installs some misbehaving kernel drivers. But in your case the plugin detects the GPUs correctly.
I assume things were working before you upgraded the packages? I would also make sure the Pod doesn't set nodeSelector etc. and force the Pod to a node without the resources.
@eero-t any ideas?
fyi, unless you also run GAS (GPU Aware Scheduling) in your cluster, there's not much benefit on enabling resource manager in GPU plugin.
@Serverfrog Could you paste here:
- pod spec, at least following sections:
nodename/nodeSelector,securityContext(both for pod & container),resources? - node k8s GPU info:
kubectl describe node YOUR_NODE_NAME | grep gpu - node GPU files info:
ls -l /dev/dri/ && head /sys/class/drm/card[0-9]/device/uevent
(Most people are on holidays this week, so definitive answer may go to next week.)
@tkatila GAS was enabled as a test from me, after it was already not working, that it maybe that. For example it worked shortly for one pod with using i915_monitoring, but only till that one pod was killed/node restartet.
@eero-t Sure! I'm thinking you mean the pod('s) that wont get the CPU Right?
...
nodeName: proxfrog2
securityContext: {}
containers:
- name: app
...
resources:
limits:
gpu.intel.com/i915: '1'
gpu.intel.com/millicores: '10'
memory: 1536Mi
requests:
cpu: 100m
gpu.intel.com/i915: '1'
gpu.intel.com/millicores: '10'
memory: 512Mi
securityContext:
privileged: true
privileged: true was added afterwards to test if it could maybe that, but that also didn't worked
❯ kubectl describe node proxfrog2 | grep gpu
gas-prefer-gpu=card0
gpu.intel.com/device-id.0300-0412.count=2
gpu.intel.com/device-id.0300-0412.present=true
gpu.intel.com/device-id.0300-56a0.present=true
gpu.intel.com/family=A_Series
intel.feature.node.kubernetes.io/gpu=true
nfd.node.kubernetes.io/extended-resources: gpu.intel.com/memory.max,gpu.intel.com/millicores,gpu.intel.com/tiles
gpu.intel.com/device-id.0300-0412.count,gpu.intel.com/device-id.0300-0412.present,gpu.intel.com/device-id.0300-56a0.present,gpu.intel.com/...
gpu.intel.com/i915: 240
gpu.intel.com/i915_monitoring: 1
gpu.intel.com/memory.max: 0
gpu.intel.com/millicores: 2k
gpu.intel.com/tiles: 2
gpu.intel.com/i915: 240
gpu.intel.com/i915_monitoring: 1
gpu.intel.com/memory.max: 0
gpu.intel.com/millicores: 2k
gpu.intel.com/tiles: 2
kube-system intel-gpu-exporter-wwf72 100m (1%) 0 (0%) 100Mi (0%) 500Mi (1%) 6h31m
kube-system intel-gpu-plugin-intel-gpu-plugin-wmkkd 40m (0%) 100m (1%) 45Mi (0%) 90Mi (0%) 6h34m
gpu.intel.com/i915 3 3
gpu.intel.com/i915_monitoring 1 1
gpu.intel.com/memory.max 0 0
gpu.intel.com/millicores 40 40
gpu.intel.com/tiles 0 0
⚡ root@proxfrog2 ~ ls -l /dev/dri/ && head /sys/class/drm/card[0-9]/device/uevent
total 0
drwxr-xr-x 2 root root 120 Dec 22 17:34 by-path
crw-rw---- 1 root video 226, 0 Dec 22 17:34 card0
crw-rw---- 1 root video 226, 1 Dec 22 17:34 card1
crw-rw---- 1 root render 226, 128 Dec 22 17:34 renderD128
crw-rw---- 1 root render 226, 129 Dec 22 17:34 renderD129
==> /sys/class/drm/card0/device/uevent <==
DRIVER=i915
PCI_CLASS=30000
PCI_ID=8086:56A0
PCI_SUBSYS_ID=172F:3937
PCI_SLOT_NAME=0000:03:00.0
MODALIAS=pci:v00008086d000056A0sv0000172Fsd00003937bc03sc00i00
==> /sys/class/drm/card1/device/uevent <==
DRIVER=i915
PCI_CLASS=30000
PCI_ID=8086:0412
PCI_SUBSYS_ID=1043:8534
PCI_SLOT_NAME=0000:00:02.0
MODALIAS=pci:v00008086d00000412sv00001043sd00008534bc03sc00i00
⚡ root@proxfrog2 ~ tree /dev/dri
/dev/dri
├── by-path
│ ├── pci-0000:00:02.0-card -> ../card1
│ ├── pci-0000:00:02.0-render -> ../renderD128
│ ├── pci-0000:03:00.0-card -> ../card0
│ └── pci-0000:03:00.0-render -> ../renderD129
├── card0
├── card1
├── renderD128
└── renderD129
@tkatila GAS was enabled as a test from me, after it was already not working, that it maybe that. For example it worked shortly for one pod with using i915_monitoring, but only till that one pod was killed/node restartet.
Monitoring resource bypasses other GPU related constraints. It's intended for monitoring all GPUs, not for using them.
privileged: truewas added afterwards to test if it could maybe that, but that also didn't worked
The whole point of device plugins is NOT needing this (as it basically breaks security and is therefore disallowed in many clusters). It has impact only when container is successfully scheduled and actually running on the node, i.e. not related to this.
gpu.intel.com/device-id.0300-0412.count=2 gpu.intel.com/device-id.0300-0412.present=true gpu.intel.com/device-id.0300-56a0.present=true
Note: neither GAS nor GPU plugin supports heterogeneous GPU nodes [1], ones where are multiple types of GPUs, that's why GPU labeler has labeled the node as having 2x Haswell iGPUs, although it has iGPU & dGPU.
That does not explain this problem, but it would be better to disable iGPU to avoid jobs intended for dGPU ending on the slow iGPU, that lacks lot of dGPU features.
[1] Intel DRA GPU driver supports such configs, and does not need GAS, but you would need k8s v1.32 to use it, and it's resource requests are a bit more complex to use: https://github.com/intel/intel-resource-drivers-for-kubernetes/
gpu.intel.com/i915: 240 gpu.intel.com/i915_monitoring: 1 gpu.intel.com/memory.max: 0 gpu.intel.com/millicores: 2k gpu.intel.com/tiles: 2 gpu.intel.com/i915: 240 gpu.intel.com/i915_monitoring: 1 gpu.intel.com/memory.max: 0 gpu.intel.com/millicores: 2k gpu.intel.com/tiles: 2
Ok, there should be enough GPU & millicore resources available, you're not requesting GPU memory, so it should be fine...
I guess that node is not e.g. tainted?
kube-system intel-gpu-exporter-wwf72 100m (1%) 0 (0%) 100Mi (0%) 500Mi (1%) 6h31m
Tuomas, what's this?
@Serverfrog does GPU scheduling work if you:
- drop GAS and disable resource management support from plugin, and/or
- disable the iGPU ?
Weird.. everything seems fine from resource point of view.
Can you drop GAS so that it doesn't interfere with the scheduling decisions? Make sure to remove the scheduler config part in /etc/kubernetes/manifests/kube-scheduler.yaml. Before removing GAS, you could check if there's anything funny in GAS' logs.
edit: as noted by Eero, also remove resource management from GPU plugin.
kube-system intel-gpu-exporter-wwf72 100m (1%) 0 (0%) 100Mi (0%) 500Mi (1%) 6h31m
Tuomas, what's this?
Probably this: https://github.com/onedr0p/intel-gpu-exporter
Monitoring resource bypasses other GPU related constraints. It's intended for monitoring all GPUs, not for using them.
kube-system intel-gpu-exporter-wwf72 100m (1%) 0 (0%) 100Mi (0%) 500Mi (1%) 6h31m Tuomas, what's this? Probably this: https://github.com/onedr0p/intel-gpu-exporter
Exactly, and thats why i changed exactly that pod (it wasn't working before i tried i951_monitoring) with the Monitoring resource.
privileged: true was added afterwards to test if it could maybe that, but that also didn't worked
The whole point of device plugins is NOT needing this (as it basically breaks security and is therefore disallowed in many clusters). It has impact only when container is successfully scheduled and actually running on the node, i.e. not related to this.
Yhea, i know. But i wanted to try if maybe with this i kinda workaroundish get it to work
I just Disabled the iGPU (i hate old BIOS, as you can only set the "Primary GPU", not "Disable iGPU"... if there is a primary, there can also be a secondary, and if my dGPU is the Primary, then i would think the iGPU would not be disabled but the secondary one...)
I also Removed the Lables and Annotations related to GAS, the resourceManager part and every millicores. Also removed the privileged as this tests did not worked.
It seems that disabling the iGPU kinda worked. I mean it makes sense, specially if it says that there are 2x the Haswell iGPU, which is blantly wrong. Which also explained why, in privileged mode where always on the iGPU (but it could also be that i could not configured ffmpeg through that interface correctly to use card0 instead card1... why ever it should use card1 over card0)
But the i915_monitoring still throws the same error, i think i will revert it back to the normal one, but still, later i kinda wanted to use the xpumanager to export that way the stats. But i as far is read up, there i should/could use the i915_monitoring Resource
Edit: i think it was most likely the iGPU, as this was card0 and renderD128 before the reboot and the card0 and renderD128, which most likely confused things up
As the GPU plugin officially only supports one type of GPU per node, the labeling rules do not work with multiple types of GPUs. For example in the count label:
gpu.intel.com/device-id.0300-0412.count=2
The 0300-0412 part is taken from the first (I think) PCI device it processes. The rules do not create multiple entries as it would multiply the number of rules needed (or a custom labeling binary). So even though the labels indicate that there were two 0412 devices, it's actually the 0412 + 56a0. The label name itself is just wrong.
@Serverfrog to summarize, your workload now works with the i915 resource, but the a pod requesting i915_monitor fails?
Which also explained why, in privileged mode where always on the iGPU (but it could also be that i could not configured ffmpeg through that interface correctly to use card0 instead card1... why ever it should use card1 over card0)
Legacy media APIs are kind of stupid compared to compute & 3D APIs, see: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/README.md#issues-with-media-workloads-on-multi-gpu-setups
But the i915_monitoring still throws the same error, i think i will revert it back to the normal one, but still, later i kinda wanted to use the xpumanager to export that way the stats. But i as far is read up, there i should/could use the i915_monitoring Resource
There's only single monitoring resource per node. Make sure that no other pod is already consuming it.
Edit: i think it was most likely the iGPU, as this was card0 and renderD128 before the reboot and the card0 and renderD128, which most likely confused things up
GPU plugin should not be confused by that, as it matches card & renderD device file nodes correctly based on info from sysfs.
But your media application could be confused, see above link for a helper script.
@Serverfrog to summarize, your workload now works with the i915 resource, but the a pod requesting i915_monitor fails?
exactly.
Only one would be used, for the GPU exporter.
GPU plugin should not be confused by that, as it matches card & renderD device file nodes correctly based on info from sysfs.
I can't really attest if it was really the case (as if the application honors the configuration, but it should) that either card0 and 1 and renderD devices, were both the iGPU, why ever.
I was specially confused as im thinking that in the end the entire device plugin would just mount the devices directly from host, like if i only wanted to use the host card1 device, i would also only have a card1 inside the container, without the card0 even if it exists
@Serverfrog to summarize, your workload now works with the i915 resource, but the a pod requesting i915_monitor fails?
exactly.
Only one would be used, for the GPU exporter.
And to double check, if you enable the monitoring and deploy a Pod with the i915_monitoring resource, the Pod won't get scheduled due to missing resources (=i915_monitoring)?
GPU plugin should not be confused by that, as it matches card & renderD device file nodes correctly based on info from sysfs.
I can't really attest if it was really the case (as if the application honors the configuration, but it should) that either
card0and 1 andrenderDdevices, were both the iGPU, why ever. I was specially confused as im thinking that in the end the entire device plugin would just mount the devices directly from host, like if i only wanted to use the hostcard1device, i would also only have acard1inside the container, without thecard0even if it exists
That's how the device plugin works. Cards on the host are mounted to the container without modifications. card1 -> card1, renderD128 -> renderD128 etc.
Sorry to going back to this issue. It worked for me for a long time, updated again and went back to the same state. I in the end of this issue back to the very basics, but after a simple Server Restart, it seems like someone (it feels like some kind of thing the operator or the gpu plugin would do) refuses to give out the GPU
Top of the Node YAML
apiVersion: v1
kind: Node
metadata:
name: proxfrog2
uid: ab68e1f7-7271-4ca5-b31f-e1d5ef10851f
resourceVersion: '301640991'
creationTimestamp: '2024-11-26T00:01:16Z'
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
feature.node.kubernetes.io/pci-0300_8086.present: 'true'
feature.node.kubernetes.io/system-os_release.ID: debian
feature.node.kubernetes.io/system-os_release.VERSION_ID: '12'
feature.node.kubernetes.io/system-os_release.VERSION_ID.major: '12'
gpu.intel.com/device-id.0300-56a0.count: '1'
gpu.intel.com/device-id.0300-56a0.present: 'true'
gpu.intel.com/family: A_Series
intel.feature.node.kubernetes.io/gpu: 'true'
k8slens-edit-resource-version: v1
kubernetes.io/arch: amd64
kubernetes.io/hostname: proxfrog2
kubernetes.io/os: linux
node-role.kubernetes.io/worker: ''
annotations:
csi.volume.kubernetes.io/nodeid: >-
{"rook-ceph.cephfs.csi.ceph.com":"proxfrog2","rook-ceph.rbd.csi.ceph.com":"proxfrog2"}
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"ae:d3:08:09:ff:b7"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: 'true'
flannel.alpha.coreos.com/public-ip: 192.168.178.118
nfd.node.kubernetes.io/feature-labels: >-
gpu.intel.com/device-id.0300-56a0.count,gpu.intel.com/device-id.0300-56a0.present,gpu.intel.com/family,intel.feature.node.kubernetes.io/gpu,pci-0300_8086.present,system-os_release.ID,system-os_release.VERSION_ID,system-os_release.VERSION_ID.major
node.alpha.kubernetes.io/ttl: '0'
volumes.kubernetes.io/controller-managed-attach-detach: 'true'
selfLink: /api/v1/nodes/proxfrog2
status:
capacity:
cpu: '8'
ephemeral-storage: 397802508Ki
gpu.intel.com/i915: '120'
gpu.intel.com/i915_monitoring: '1'
hugepages-1Gi: '0'
hugepages-2Mi: '0'
memory: 32772036Ki
pods: '110'
allocatable:
cpu: 7950m
ephemeral-storage: '366346355310'
gpu.intel.com/i915: '120'
gpu.intel.com/i915_monitoring: '1'
hugepages-1Gi: '0'
hugepages-2Mi: '0'
memory: 32145348Ki
pods: '110'
for one restrart, the i915_monitoring was working, but it also wont work anymore.
The thing is... my setup is a bit cursed, as i run 4 Talos Nodes and one plain build worker node, where the GPU is running on.
It is weird as some changes are not really or very slow communicated to etcd so that, for example, i delete a pod, it will be stuck in the Terminating state. Could it be that there is some residue information in the etcd that leads to these kinds of problems?
(Allocate failed due to requested number of devices unavailable for gpu.intel.com/i915. Requested: 1, Available: 0, which is unexpected)
EDIT: ...
one more reason i could thing of residue is... i increased the sharedNumDev to 240 and everything is starting? is there some way i can debug things like that? like viewing the assigments etc? increase log level to very verbose?
Hi @Serverfrog
You can increase the GPU plugin logs in the GPU CR by setting spec.logLevel to 4. Or add "-v=4" to the GPU plugin deployment's arguments. The operator just instantiates the plugin, so there shouldn't be any need to increase its log level.
Changing shared-dev-num from one to another shouldn't magically cause things to work again. Though, you could try to debug the state changes in the cluster by changing the shared-dev-num and then checking how long it takes for the node to get updated.
In general, when gpu.intel.com/i915 or i915_monitoring is greater than zero in the pod's spec, GPU plugin has done its jobs. It's the scheduler then that decides where to put the Pod, or it will complain that there are no nodes available with such and such resources.