kured
kured copied to clipboard
Kured says reboot not required even though there is a reboot-require file present on the kubernetes cluster linux node
Deployed the latest release of kured(1.13.1) on to an Azure kubernetes cluster with kubernetes version (v1.26.3) almost one month back. I don't see any reboot-required created on the nodes and so I have created the dummy "reboot-required" file present in the "/var/run" path on all nodes of the cluster. Unfortunately the nodes are not rebooting and looking at the logs for the kured pods it says reboot not required.
Create /var/run/reboot-required Dummy file:
Kured pod logs:
Do we need to have cluster auto-upgrade enabled with node-image to make kured to work?
Hi @deepaknani007, thanks for the bug-report. Does this behaviour still exist? It's hard to evaluate why the file was not detected. There are no further configs required on the infrastructure to make kured work.
same problem in kube-adm deployed cluster with flatcar stable. the file is there in all nodes but no reboots. it is a vanilla installation of kured in kubernetes v1.27.2
time="2023-08-11T10:01:27Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID" time="2023-08-11T10:01:27Z" level=info msg="Kubernetes Reboot Daemon: 1.13.2" time="2023-08-11T10:01:27Z" level=info msg="Node ID: XXX" time="2023-08-11T10:01:27Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock" time="2023-08-11T10:01:27Z" level=info msg="Lock TTL not set, lock will remain until being released" time="2023-08-11T10:01:27Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting" time="2023-08-11T10:01:27Z" level=info msg="PreferNoSchedule taint: " time="2023-08-11T10:01:27Z" level=info msg="Blocking Pod Selectors: []" time="2023-08-11T10:01:27Z" level=info msg="Reboot schedule: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC" time="2023-08-11T10:01:27Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1h0m0s" time="2023-08-11T10:01:27Z" level=info msg="Reboot command: [/bin/systemctl reboot]"
later i can see "reboot not required"
Okay, thanks for this information. We will do a release later this month when kubernetes released its next minor. #806 will be included there which adds a warn-log for non -1 exit-codes for the sentinel-check command. Maybe something is crashing here on your hosts. This would cause kured to avoid reboots.
in my case, finally the reboot started without changing anything. I do not know the reason. If I find something I will tell you
A new flatcar release and the same problem. The /var/run/reboot-required file exists but no reboots. I am working with the 1.14.0 release and i do not find a way to debug this. How kured checks if that file exists?
The grafana dashboard shows the nodes need to be rebooted
@jorgelon Do you see the following warn-log in the kured-pod-logs?
sentinel command ended with unexpected exit code
This was added with 1.14.0. The problem with the host-commands is: We don't know what happens on the host, when the command crashes with an unexpected error or is blocked by some security-tools (e.g. aquasec, ...) this warn-log is the only indicator. Maybe you can analyze your host-logs for abnormalities around the check-executions.
Nope @ckotzbauer I do not see that log I only see the same as in my Aug 11 annotation
Okay, that's sad. Then it will be very hard to figure out why the file is not detected. Kured logs the output of the "test -f" command and logs a warning when the exit-code is something unexpected. So it seems that the command either crashes silently (maybe something is logged in the syslog) or just exits with the exit-code which indicates that no reboot is required (also when the file exists)
We will land some bigger security-improvements to 1.15.0, then we will mount the directory of the reboot-file as host-mount and do a "normal" existance check without "nsenter", this should work more smoothly. But 1.15.0 will be released after Kubernetes 1.29.0 (so in December).
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).
updates with 1.15.0 . no changes
inside a kured pod in a node with /var/run/reboot-required present
/tmp # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required /tmp # echo $? 0 /tmp # wget -qO- 127.0.0.1:8080/metrics | grep ^kured kured_reboot_required{node="XXXXXX"} 0
I have tried using https://raw.githubusercontent.com/kubereboot/kured/main/kured-rbac.yaml https://raw.githubusercontent.com/kubereboot/kured/main/kured-ds-signal.yaml
Now I get wget -qO- 127.0.0.1:8080/metrics | grep ^kured kured_reboot_required{node="XXX"} 1
/ # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required nsenter: can't open '/proc/1/ns/mnt': Permission denied
The kured-1.15.0-dockerhub.yaml does not mount anything from the host.
Still no reboots
Thanks @jorgelon for coming back to this thread,
with 1.15.0 the nsenter on the hosts /var/run/reboot-required
is not required anymore.
You have two options now:
- Use the
kured-ds.yaml
with the host-mount (or the kured-1.15.0-dockerhub.yaml - I updated the file right now and added the missing host - thanks for the hint), this should be more stable than the nsenter. kured then checks for/sentinel/reboot-required
- You can use the new signal method, this works also with the host-mount and does not use
nsenter
at all, why did you try thensenter
command inside this pod? It is intended not to work.
Right now I am using the helm chart to see if I get some different results. Default values.yaml I have 4 flatcar nodes. Only 2 need reboot and the pods in the daemon set shows the correct result with wget -qO- 127.0.0.1:8080/metrics | grep ^kured The response in the nodes that need reboot is 1 test -f /sentinel/reboot-required returns 0
But nothing happens. No reboot, no log, no annotations I keep investigating
My doubt is how /bin/systemctl reboot is performed if that binary does not exists in the kured pods
The binary is not called inside the pod, its called with nsenter
on the host.
Does the problem persist with the "signal" method and the helm-chart?
Does the pod still write "Reboot not required"?
Same problem here with kured-5.5.0 chart but it's working on another cluster where I have ghcr.io/kubereboot/kured:1.14.0 installed by manifest file.