microk8s icon indicating copy to clipboard operation
microk8s copied to clipboard

IPv6 Neighbour-Discovery messages cause all pods to restart

Open DanTup opened this issue 2 years ago • 6 comments

I was recently setting up a Thread Border Router device on my network. Whenever the device joined the network, it caused all of the pods on my single-machine cluster to be restarted.

I found these in the logs:

Nov 24 19:36:01 mario microk8s.daemon-apiserver-kicker[722]: CSR change detected. Restarting the cluster-agent
Nov 24 19:36:01 mario microk8s.daemon-apiserver-kicker[722]: CSR change detected. Reconfiguring the kube-apiserver

It appeared some kind of network change was triggering this. I initially thought maybe the BR was running DHCP or something, but after posting in the BR tracker (https://github.com/espressif/esp-thread-br/issues/45) it seems this is Neighbour Discovery which allocates IPv6 addresses on the network to allow bi-directional IPv6 communication across the BR.

I was able to avoid the issue by setting --bind-address=0.0.0.0 (which I presume means only IPv4 addresses are considered, so the IPv6 address won't trigger this), but since while searching I found a lot of people complaining about random reboots I thought it may be useful to a) verify if this default behaviour is WAI, and b) have some notes on this cause (and a workaround/fix) noted here that may help others diagnose the same issue in future.

DanTup avatar Nov 26 '23 09:11 DanTup

Hi @DanTup, thank you for raising the issue.

An alternative approach could be to create the no-cert-reissue lock file like this:

sudo touch /var/snap/microk8s/current/var/lock/no-cert-reissue

Which has the same effect, it will disable the MicroK8s automatic check for changes in the IP addresses of the host.

neoaggelos avatar Nov 27 '23 09:11 neoaggelos

Ah, that sounds a little better, I'll change to that.

Out of interest - what's the reason for this behaviour? At first I thought it was because changes in IP addresses could mean the service needs to bind to new addresses, but it seems like it's related to a certificate being regenerated rather than the network itself.

(and also, is it intended that IPv6 router advertisements could trigger this? perhaps it's necessary, but it feels quite aggressive to me!)

DanTup avatar Nov 27 '23 09:11 DanTup

For context, MicroK8s originally started as a dev-focused distribution. Dev machines might have DHCP addresses, leases might expire, or might move between networks/offices etc. This meant that the certificates generated for the first IP address would be invalid, therefore the cluster would go into a broken state.

A long time ago, a simple approach to tackle this was to check for the host IP addresses, then refresh the certificates if they ever changed, which "always works perfectly most of the time" ( :) ).

Essentially, any change in the machine networking would trigger this (which indeed might not be what you want). We have also seen this being an issue when using things like kube-vip, or bridged networking for VMs/containers.

Whether having the no-cert-reissue lock file by default is up for debate. IMHO, not breaking backwards compatibility is a sensible approach.

neoaggelos avatar Nov 27 '23 10:11 neoaggelos

This meant that the certificates generated for the first IP address would be invalid, therefore the cluster would go into a broken state

I'm not very familiar with this, so apologies if this is a silly question (I'm just curious to understand), but under what conditions would this cause problems? Would a new IP address being added to a machine ever cause problems, or is it only changes (or more specifically, removals)?

Because if it's only removals, I wonder if the reissue could be done only if:

  • an IP addresses removed; and
  • it wasn't an IP address that was added since the cert was last issued.

Eg., if I an IP address is added and later removed (without changing any IPs that existed when the cert was first issued) and the cert was not re-issued, would that cause any problems at all?

DanTup avatar Nov 27 '23 11:11 DanTup

Hiya, apologies for missing this, dropped the notification somehow.

I'm not very familiar with this, so apologies if this is a silly question (I'm just curious to understand), but under what conditions would this cause problems? Would a new IP address being added to a machine ever cause problems, or is it only changes (or more specifically, removals)?

The go-to example scenario: A developer has a laptop running on their home network (let's say with address 192.168.1.42/24) and installs MicroK8s. They shut down the laptop on Friday, then on Monday the DHCP lease is expired, and the router now gives a different IP, e.g. 192.168.1.54/24. This means that the certificate IP SANs are now all broken (as the IP address has changed).

To avoid this problem, MicroK8s by default performs a check to see if any of the IP addresses of the machine have changed (which includes lots of false positives, but works in a mostly frictionless manner).

Eg., if I an IP address is added and later removed (without changing any IPs that existed when the cert was first issued) and the cert was not re-issued, would that cause any problems at all?

Indeed that would not be an issue. As long as you are not using the Kubernetes services over an IP that is not included in the certificate IP SANs, there should not be any problems.

neoaggelos avatar Dec 27 '23 16:12 neoaggelos

I've ran in to this exact issue. I think it's fair enough that it is an ingrained behaviour that would be difficult to break although the method for disabling it could be more obvious and the documentation should be updated to make it clear that running IPV6 on the nodes introduces major instability.

bendews avatar Jan 17 '24 10:01 bendews