node-problem-detector icon indicating copy to clipboard operation
node-problem-detector copied to clipboard

node-problem-detector cannot run in non-privileged mode

Open ialidzhikov opened this issue 2 years ago • 15 comments

/kind bug

What happened?

Running containers in privileged mode is not recommended as privileged containers run with all linux capabilities enabled and can access the host's resources. Running containers in privileged mode opens number of security threads such as breakout to underlying host OS.

Currently the node-problem-detector DaemonSet runs in privileged mode.

https://github.com/kubernetes/node-problem-detector/blob/d8b2940b3cac1d99c9072dd644c7dfb372672114/deployment/node-problem-detector.yaml#L41-L42

Trying to run node-problem-detector in non-privileged mode (even with all capabilities added) one of its monitors fails with:

E0808 06:25:33.740326       1 problem_detector.go:55] Failed to start problem daemon &{/config/kernel-monitor.json 0xc00035b7a0 0xc000443100 {{kmsg map[] /dev/kmsg 5m } 10 kernel-monitor [{KernelDeadlock  {0 0 <nil>} KernelHasNoDeadlock kernel has no deadlock} {ReadonlyFilesystem  {0 0 <nil>} FilesystemIsNotReadOnly Filesystem is not read-only}] [{temporary  OOMKilling Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*} {temporary  TaskHung task [\S ]+:\w+ blocked for more than \w+ seconds\.} {temporary  UnregisterNetDevice unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {temporary  KernelOops BUG: unable to handle kernel NULL pointer dereference at .*} {temporary  KernelOops divide error: 0000 \[#\d+\] SMP} {temporary  Ext4Error EXT4-fs error .*} {temporary  Ext4Warning EXT4-fs warning .*} {temporary  IOError Buffer I/O error .*} {temporary  MemoryReadError CE memory read error .*} {permanent KernelDeadlock DockerHung task docker:\w+ blocked for more than \w+ seconds\.} {permanent ReadonlyFilesystem FilesystemIsReadOnly Remounting filesystem read-only}] 0xc00043d21e} [] <nil> 0xc00045aea0 0xc00044bb80}: failed to create kmsg parser: open /dev/kmsg: operation not permitted

I don't fully understand what it requires to read kernel logs from /dev/kmsg.

What did you expect to happen?

I would expect to be able to run node-problem-detector in non-privileged mode.

ialidzhikov avatar Sep 01 '22 06:09 ialidzhikov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 30 '22 07:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 30 '22 08:12 k8s-triage-robot

/remove-lifecycle rotten

ialidzhikov avatar Dec 30 '22 16:12 ialidzhikov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 30 '23 16:03 k8s-triage-robot

/remove-lifecycle stale

ialidzhikov avatar Mar 31 '23 07:03 ialidzhikov

Any update on this ?

balu-ce avatar May 04 '23 11:05 balu-ce

Duplicate of https://github.com/kubernetes/node-problem-detector/issues/625

btiernay avatar May 24 '23 02:05 btiernay

Duplicate of #625

Both issues DO NOT have a solution for the problem @ialidzhikov mentioned and that I'm currently experiencing. The "duplicate" issue you (@btiernay) shared only contains comments from @k8s-triage-robot. No solution is provided 🤷

Any solution so far?

AlexzSouz avatar Dec 08 '23 11:12 AlexzSouz

How about trying with plugin of journald instead? it works fine for me to detect "NodeOOM", "PodOOM" with pattern ".Out of memory." and ".Memory cgroup out of memory."

alazyer avatar Dec 29 '23 07:12 alazyer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 28 '24 07:03 k8s-triage-robot

NPD's goal is to detect infra layer issues. So it needs to read logs in a place where non-privileged containers do not have permission. Additionally, we use health checker in production to repair kubelet and containerd by killing them. Those need privilege.

Depending on how you would like to use NPD, there may be a chance that you can tune your daemonset yaml without the privilege access. @hakman for kops, does it run NPD in non-privilege mode?

wangzhen127 avatar Apr 05 '24 17:04 wangzhen127

/remove-kind bug

wangzhen127 avatar Apr 05 '24 17:04 wangzhen127

/remove-lifecycle stale

wangzhen127 avatar Apr 05 '24 17:04 wangzhen127

Hello, I am also facing similar issue while reading from /dev/kmsg using NPD while my container is not given privileged mode. Is there any workaround? We only need to read, no mutating actions on our side.

haardm avatar Jun 08 '24 00:06 haardm

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 06 '24 01:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 06 '24 01:10 k8s-triage-robot