kilo
kilo copied to clipboard
Repeated attempts to reconcile mesh network
I have an issue where when I connect an outside peer (eg. my laptop) to the cluster, kilo
sees that configurations aren't the same and recreates the mesh to reconcile the differences. However, the config is never as expected and kilo
will constantly attempt to reconcile, killing the network every ~30 seconds
I'm going to keep debugging, but I created this issue just in case you know what's up before I spend time here.
I added some prints to see what was going on:
level.Info(logger).Log("reason", "peer endpoints", "c", c, "b", b)
B | C |
|
|
Turns out my laptop peer, 10.5.0.1
, has a configured endpoint in oldConf
, b
, but is null
in the new conf, c
, and that's what's causing kilo
to reconcile the differences
i think it is because your laptop's endpoint is discovered since #146 and now Kilo wants to reapply the spec of your Laptop's peer that has a nil endpoint because the actual endpoint has been added and spec and reality have diverged. Let me check why I haven't noticed this with my laptop. Maybe this is wrong.
What is the Peer spec of your laptop. Did you set persitent-keep-alive to 0? Because the endpoint is not updated if it is 0: https://github.com/squat/kilo/blob/05e8ded744207571389e208353209016c449ba79/pkg/mesh/topology.go#L275
What is the Peer spec of your laptop. Did you set persitent-keep-alive to 0? Because the endpoint is not updated if it is 0:
https://github.com/squat/kilo/blob/05e8ded744207571389e208353209016c449ba79/pkg/mesh/topology.go#L275
Brilliant, that's exactly what's happening. I've added a persistentKeepalive
and the network stays stable.
Defining a peer with a persistent keep alive of 0
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: laptop
spec:
allowedIPs:
- 10.5.0.1/32
publicKey: SzhsHapvJy61urzHTAvx3Iu7ANlO+PGbsPy/mKY8U10=
persistentKeepalive: 0
Still sees kilo attempt to reconcile the mesh network; line 3, 30~ seconds after apply:
{"caller":"mesh.go:344","component":"kilo","event":"add","level":"info","peer":{"PublicKey":[75,56,108,29,170,111,39,46,181,186,188,199,76,11,241,220,139,187,0,217,78,248,241,155,176,252,191,152,166,60,83,93],"Remove":false,"UpdateOnly":false,"PresharedKey":null,"PersistentKeepaliveInterval":0,"ReplaceAllowedIPs":false,"AllowedIPs":[{"IP":"10.5.0.1","Mask":"/////w=="}],"Endpoint":null,"Name":"laptop"},"ts":"2022-05-25T00:50:29.118108442Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=1, new=2","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-25T00:50:29.16908714Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"peer endpoints: nil value","level":"info","msg":"WireGuard configurations are different","ts":"2022-05-25T00:50:59.040795773Z"}
Is the intention of this code-path to prevent mesh reconciliation if pka == nil || pka == 0
? Or am I misunderstanding?
https://github.com/squat/kilo/blob/4be792ea543a9c2656574ec060b335c587244a3d/pkg/mesh/topology.go#L291
FWIW, I'm not bothered about keeping otherwise silent connections alive through NAT
Some mysterious behaviour I don't quite understand; I have a peer configuration called phone
that is intended for my well, uh, phone, which didn't cause mesh reconciliation—I'm tailing kilo's logs. My phone is connected to the same WiFi network, there's no cellular involved here.
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: laptop
spec:
allowedIPs:
- 10.5.0.1/32
publicKey: SzhsHapvJy61urzHTAvx3Iu7ANlO+PGbsPy/mKY8U10=
persistentKeepalive: 0
---
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: phone
spec:
allowedIPs:
- 10.5.0.2/32
publicKey: urgVgSoHEwG5/7q0k5NpjWSBpAyxPfhvdT/v0zd561o=
persistentKeepalive: 0
Taking a stab in the dark that something is up with the laptop
peer, I created a third peer, dummy
, and connected from my laptop. No good; there's mesh reconciliation there too.
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: dummy
spec:
allowedIPs:
- 10.5.0.3/32
publicKey: AzckRiPfM30PNbyX/kxCv59YlIfaoj/hVU7LPkxuuAw=
persistentKeepalive: 0
Okay, so now thinking something is up with the clients, I migrate the laptop
peer config to my phone and connect from there. No good; reconciliation again. I try dummy
from my phone. Also reconciliation.
So now the reverse—export the phone
peer and import it on my laptop. Strange—there's no reconciliation at all. For whatever reason the phone
peer doesn't cause any undesired behaviour.
I moved the private key from dummy
to phone
, kept the rest the same; mesh reconciliation.
Reset phone
back to the original keypair—no reconciliation.
🤯