intel-device-plugins-for-kubernetes
intel-device-plugins-for-kubernetes copied to clipboard
GPU crashing on 1 node.
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
nuc1 Ready control-plane,etcd,master,worker 2y40d v1.26.9+k3s1 x.x.x.x <none> Fedora Linux 38 (Server Edition) 6.5.6-200.fc38.x86_64 containerd://1.7.6-k3s1.26
nuc2 Ready control-plane,coral.ai,etcd,master,worker 127m v1.26.9+k3s1 x.x.x.x <none> Fedora Linux 39 (Server Edition) 6.6.2-201.fc39.x86_64 containerd://1.7.6-k3s1.26
nuc3 Ready control-plane,etcd,master,worker 42d v1.26.9+k3s1 x.x.x.x <none> Fedora Linux 38 (Server Edition) 6.5.8-200.fc38.x86_64 containerd://1.7.6-k3s1.26
Running 3 master nodes using k3s NUC 1 & 3 both deploy fine. NUC 2 the container crashes with
E1216 11:45:32.208374 1 manager.go:146] Failed to serve gpu.intel.com/i915: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"
Cannot register to kubelet service
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).registerWithKubelet
/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:352
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).setupAndServe
/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:280
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).Serve
/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:207
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).handleUpdate.func1
/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:144
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1598
command used to provision NUC2:
curl -sfL https://get.k3s.io | K3S_URL=https://cluster.domain:6443 K3S_TOKEN=1:server:1 INSTALL_K3S_VERSION=v1.26.9+k3s1 sh -s - server --flannel-backend=none --disable-network-policy --cluster-cidr=x.x.x.x/x --service-cidr=x.x.x.x/x --cluster-init --disable=servicelb --disable traefik --selinux
The only differences between NUC2 and NUC1/3 are:
- NUC2 is FC39 and the others are FC38
- When starting k3s on NUC2 it complained about selinux and said to add '--selinux' to the startup command (the other two nodes dont have this)
Any advice appreciated.
I will test re-adding the node without the --selinux
and if all else fails change it to FC38.
Hi @ryanm101
I found a bit similar error here: https://github.com/intel/intel-technology-enabling-for-openshift/issues/113. There are a couple of workarounds in the issue that could work. Could you try them out?
I reproduced the issue on a VM. Device plugin seems to work without selinux but fails with selinux. In the selinux audit logs there is an entry:
type=AVC msg=audit(1702889339.432:3913): avc: denied { connectto } for pid=16332 comm="intel_gpu_devic" path="/var/lib/kubelet/device-plugins/kubelet.sock" scontext=system_u:system_r:container_device_plugin_t:s0:c620,c968 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=0
I'll need to study if this is similar/same as the above linked issue.
EDIT: using setenforce 0
is a workaround. Though, not plausible if selinux is required.
setenforce 0
corrects it but Nuc1&3 are both enforcing and working fine.
I followed instructions from the audit entry:
sudo ausearch -c 'intel_gpu_devic' --raw | audit2allow -M intelgpudevice
sudo semodule -X 300 -i intelgpudevice.pp
That seems to allow device plugin to access kubelet. I'm not sure where we should file a bug to: FC, k3s or somewhere else.
The plugins already run with proper label to have access to kubelet. That policy went into container-selinux package. Is that package installed on your node?
Those get installed alongside k3s. and are installed.
I followed instructions from the audit entry:
sudo ausearch -c 'intel_gpu_devic' --raw | audit2allow -M intelgpudevice sudo semodule -X 300 -i intelgpudevice.pp
That seems to allow device plugin to access kubelet. I'm not sure where we should file a bug to: FC, k3s or somewhere else.
Yes this seems to solve it.
@mregmi do you happen to know the container-selinux version?