node-problem-detector node-problem-detector cannot run in non-privileged mode

/kind bug

What happened?

Running containers in privileged mode is not recommended as privileged containers run with all linux capabilities enabled and can access the host's resources. Running containers in privileged mode opens number of security threads such as breakout to underlying host OS.

Currently the node-problem-detector DaemonSet runs in privileged mode.

https://github.com/kubernetes/node-problem-detector/blob/d8b2940b3cac1d99c9072dd644c7dfb372672114/deployment/node-problem-detector.yaml#L41-L42

Trying to run node-problem-detector in non-privileged mode (even with all capabilities added) one of its monitors fails with:

E0808 06:25:33.740326       1 problem_detector.go:55] Failed to start problem daemon &{/config/kernel-monitor.json 0xc00035b7a0 0xc000443100 {{kmsg map[] /dev/kmsg 5m } 10 kernel-monitor [{KernelDeadlock  {0 0 <nil>} KernelHasNoDeadlock kernel has no deadlock} {ReadonlyFilesystem  {0 0 <nil>} FilesystemIsNotReadOnly Filesystem is not read-only}] [{temporary  OOMKilling Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*} {temporary  TaskHung task [\S ]+:\w+ blocked for more than \w+ seconds\.} {temporary  UnregisterNetDevice unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {temporary  KernelOops BUG: unable to handle kernel NULL pointer dereference at .*} {temporary  KernelOops divide error: 0000 \[#\d+\] SMP} {temporary  Ext4Error EXT4-fs error .*} {temporary  Ext4Warning EXT4-fs warning .*} {temporary  IOError Buffer I/O error .*} {temporary  MemoryReadError CE memory read error .*} {permanent KernelDeadlock DockerHung task docker:\w+ blocked for more than \w+ seconds\.} {permanent ReadonlyFilesystem FilesystemIsReadOnly Remounting filesystem read-only}] 0xc00043d21e} [] <nil> 0xc00045aea0 0xc00044bb80}: failed to create kmsg parser: open /dev/kmsg: operation not permitted

I don't fully understand what it requires to read kernel logs from /dev/kmsg.

What did you expect to happen?

I would expect to be able to run node-problem-detector in non-privileged mode.

Sep 01 '22 06:09 ialidzhikov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 30 '22 07:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Dec 30 '22 08:12 k8s-triage-robot

/remove-lifecycle rotten

Dec 30 '22 16:12 ialidzhikov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 30 '23 16:03 k8s-triage-robot

/remove-lifecycle stale

Mar 31 '23 07:03 ialidzhikov

Any update on this ?

May 04 '23 11:05 balu-ce

Duplicate of https://github.com/kubernetes/node-problem-detector/issues/625

May 24 '23 02:05 btiernay

Duplicate of #625

Both issues DO NOT have a solution for the problem @ialidzhikov mentioned and that I'm currently experiencing. The "duplicate" issue you (@btiernay) shared only contains comments from @k8s-triage-robot. No solution is provided 🤷

Any solution so far?

Dec 08 '23 11:12 AlexzSouz

How about trying with plugin of journald instead? it works fine for me to detect "NodeOOM", "PodOOM" with pattern ".Out of memory." and ".Memory cgroup out of memory."

Dec 29 '23 07:12 alazyer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 28 '24 07:03 k8s-triage-robot

NPD's goal is to detect infra layer issues. So it needs to read logs in a place where non-privileged containers do not have permission. Additionally, we use health checker in production to repair kubelet and containerd by killing them. Those need privilege.

Depending on how you would like to use NPD, there may be a chance that you can tune your daemonset yaml without the privilege access. @hakman for kops, does it run NPD in non-privilege mode?

Apr 05 '24 17:04 wangzhen127

/remove-kind bug

Apr 05 '24 17:04 wangzhen127

/remove-lifecycle stale

Apr 05 '24 17:04 wangzhen127

Hello, I am also facing similar issue while reading from /dev/kmsg using NPD while my container is not given privileged mode. Is there any workaround? We only need to read, no mutating actions on our side.

Jun 08 '24 00:06 haardm

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 06 '24 01:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 06 '24 01:10 k8s-triage-robot

node-problem-detector node-problem-detector copied to clipboard

node-problem-detector cannot run in non-privileged mode

What happened?

What did you expect to happen?

node-problem-detector
node-problem-detector copied to clipboard