kured icon indicating copy to clipboard operation
kured copied to clipboard

Kured says reboot not required even though there is a reboot-require file present on the kubernetes cluster linux node

Open deepaknani007 opened this issue 1 year ago • 16 comments

Deployed the latest release of kured(1.13.1) on to an Azure kubernetes cluster with kubernetes version (v1.26.3) almost one month back. I don't see any reboot-required created on the nodes and so I have created the dummy "reboot-required" file present in the "/var/run" path on all nodes of the cluster. Unfortunately the nodes are not rebooting and looking at the logs for the kured pods it says reboot not required.

Create /var/run/reboot-required Dummy file: image

Kured pod logs: image

deepaknani007 avatar Jun 20 '23 20:06 deepaknani007

Do we need to have cluster auto-upgrade enabled with node-image to make kured to work?

deepaknani007 avatar Jun 21 '23 15:06 deepaknani007

Hi @deepaknani007, thanks for the bug-report. Does this behaviour still exist? It's hard to evaluate why the file was not detected. There are no further configs required on the infrastructure to make kured work.

ckotzbauer avatar Aug 02 '23 17:08 ckotzbauer

same problem in kube-adm deployed cluster with flatcar stable. the file is there in all nodes but no reboots. it is a vanilla installation of kured in kubernetes v1.27.2

time="2023-08-11T10:01:27Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID" time="2023-08-11T10:01:27Z" level=info msg="Kubernetes Reboot Daemon: 1.13.2" time="2023-08-11T10:01:27Z" level=info msg="Node ID: XXX" time="2023-08-11T10:01:27Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock" time="2023-08-11T10:01:27Z" level=info msg="Lock TTL not set, lock will remain until being released" time="2023-08-11T10:01:27Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting" time="2023-08-11T10:01:27Z" level=info msg="PreferNoSchedule taint: " time="2023-08-11T10:01:27Z" level=info msg="Blocking Pod Selectors: []" time="2023-08-11T10:01:27Z" level=info msg="Reboot schedule: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC" time="2023-08-11T10:01:27Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1h0m0s" time="2023-08-11T10:01:27Z" level=info msg="Reboot command: [/bin/systemctl reboot]"

later i can see "reboot not required"

jorgelon avatar Aug 11 '23 10:08 jorgelon

Okay, thanks for this information. We will do a release later this month when kubernetes released its next minor. #806 will be included there which adds a warn-log for non -1 exit-codes for the sentinel-check command. Maybe something is crashing here on your hosts. This would cause kured to avoid reboots.

ckotzbauer avatar Aug 11 '23 14:08 ckotzbauer

in my case, finally the reboot started without changing anything. I do not know the reason. If I find something I will tell you

jorgelon avatar Aug 14 '23 06:08 jorgelon

A new flatcar release and the same problem. The /var/run/reboot-required file exists but no reboots. I am working with the 1.14.0 release and i do not find a way to debug this. How kured checks if that file exists?

The grafana dashboard shows the nodes need to be rebooted

jorgelon avatar Sep 13 '23 08:09 jorgelon

@jorgelon Do you see the following warn-log in the kured-pod-logs?

sentinel command ended with unexpected exit code

This was added with 1.14.0. The problem with the host-commands is: We don't know what happens on the host, when the command crashes with an unexpected error or is blocked by some security-tools (e.g. aquasec, ...) this warn-log is the only indicator. Maybe you can analyze your host-logs for abnormalities around the check-executions.

ckotzbauer avatar Sep 13 '23 18:09 ckotzbauer

Nope @ckotzbauer I do not see that log I only see the same as in my Aug 11 annotation

jorgelon avatar Sep 15 '23 14:09 jorgelon

Okay, that's sad. Then it will be very hard to figure out why the file is not detected. Kured logs the output of the "test -f" command and logs a warning when the exit-code is something unexpected. So it seems that the command either crashes silently (maybe something is logged in the syslog) or just exits with the exit-code which indicates that no reboot is required (also when the file exists)

We will land some bigger security-improvements to 1.15.0, then we will mount the directory of the reboot-file as host-mount and do a "normal" existance check without "nsenter", this should work more smoothly. But 1.15.0 will be released after Kubernetes 1.29.0 (so in December).

ckotzbauer avatar Sep 15 '23 15:09 ckotzbauer

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

github-actions[bot] avatar Nov 15 '23 01:11 github-actions[bot]

updates with 1.15.0 . no changes

inside a kured pod in a node with /var/run/reboot-required present

/tmp # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required /tmp # echo $? 0 /tmp # wget -qO- 127.0.0.1:8080/metrics | grep ^kured kured_reboot_required{node="XXXXXX"} 0

jorgelon avatar Jan 23 '24 09:01 jorgelon

I have tried using https://raw.githubusercontent.com/kubereboot/kured/main/kured-rbac.yaml https://raw.githubusercontent.com/kubereboot/kured/main/kured-ds-signal.yaml

Now I get wget -qO- 127.0.0.1:8080/metrics | grep ^kured kured_reboot_required{node="XXX"} 1

/ # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required nsenter: can't open '/proc/1/ns/mnt': Permission denied

The kured-1.15.0-dockerhub.yaml does not mount anything from the host.

Still no reboots

jorgelon avatar Jan 23 '24 09:01 jorgelon

Thanks @jorgelon for coming back to this thread, with 1.15.0 the nsenter on the hosts /var/run/reboot-required is not required anymore.

You have two options now:

  1. Use the kured-ds.yaml with the host-mount (or the kured-1.15.0-dockerhub.yaml - I updated the file right now and added the missing host - thanks for the hint), this should be more stable than the nsenter. kured then checks for /sentinel/reboot-required
  2. You can use the new signal method, this works also with the host-mount and does not use nsenter at all, why did you try the nsenter command inside this pod? It is intended not to work.

ckotzbauer avatar Jan 23 '24 18:01 ckotzbauer

Right now I am using the helm chart to see if I get some different results. Default values.yaml I have 4 flatcar nodes. Only 2 need reboot and the pods in the daemon set shows the correct result with wget -qO- 127.0.0.1:8080/metrics | grep ^kured The response in the nodes that need reboot is 1 test -f /sentinel/reboot-required returns 0

But nothing happens. No reboot, no log, no annotations I keep investigating

My doubt is how /bin/systemctl reboot is performed if that binary does not exists in the kured pods

jorgelon avatar Jan 25 '24 10:01 jorgelon

The binary is not called inside the pod, its called with nsenter on the host. Does the problem persist with the "signal" method and the helm-chart? Does the pod still write "Reboot not required"?

ckotzbauer avatar Jan 25 '24 16:01 ckotzbauer

Same problem here with kured-5.5.0 chart but it's working on another cluster where I have ghcr.io/kubereboot/kured:1.14.0 installed by manifest file.

lderugeriis avatar Jul 25 '24 19:07 lderugeriis

I'll point to my comment here, seems to be really related: Issue #952

Any solution?

urbaman avatar Oct 20 '24 07:10 urbaman