gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Kubernetes roles are continuously created

Open lemaral opened this issue 2 years ago • 3 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [x] Are you running on an Ubuntu 18.04 node? Yes
  • [x] Are you running Kubernetes v1.13+? Yes
  • [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Docker
  • [x] Do you have i2c_core and ipmi_msghandler loaded on the nodes? Yes
  • [x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) Yes

1. Issue or feature description

According to audit log, Kubernetes roles seem to be continuously created : nvidia-driver nvidia-mig-manager nvidia-operator-validator etc

2. Steps to reproduce the issue

Install gpu-operator with helm chart (in kube-system)

3. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: kubectl get pods --all-namespaces

  • [ ] kubernetes daemonset status: kubectl get ds --all-namespaces

  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • [ ] Output of running a container on the GPU machine: docker run -it alpine echo foo

  • [ ] Docker configuration file: cat /etc/docker/daemon.json

  • [ ] Docker runtime configuration: docker info | grep runtime

  • [ ] NVIDIA shared directory: ls -la /run/nvidia

  • [ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

  • [ ] NVIDIA driver directory: ls -la /run/nvidia/driver

  • [ ] kubelet logs journalctl -u kubelet > kubelet.logs

lemaral avatar May 30 '22 19:05 lemaral

@lemaral we had to update each resource for supporting upgrade use cases. This usually happens if reconciliation is triggered quite often and should settle once drivers are loaded and all pods are running. You see this happening even after all pods are in good state?

shivamerla avatar May 31 '22 20:05 shivamerla

@shivamerla thank you for your reply, yes it never stops, although the gpu operator itself is working perfectly. I am seeing this using Falco and had to disable the relevant default rule to stop the flood. I believe it takes a bit on etcd as well.

lemaral avatar May 31 '22 20:05 lemaral

can you paste the output of kubectl get pods -n gpu-operator and last logs of operator pod?

shivamerla avatar May 31 '22 20:05 shivamerla