linkerd2
linkerd2 copied to clipboard
Linkerd-destination: unable to connect to validator
What is the issue?
Hi
After installing linkerd-cni. the Linkerd pods are unable to start due to the following error:
Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Host is unreachable (os error 113)
How can it be reproduced?
Install linkerd-cni and linkerd on a flatcar kubernetes 1.28.3 cluster with cilium as CNI.
Logs, error output, etc
2023-11-09T11:42:46.686000Z INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2023-11-09T11:42:46.686030Z DEBUG linkerd_network_validator: token="KXyajGp2VZRdLXMQEEAqBJoJUeNIUUUhajU7NmAqDTmCn9fcj9GyrFcDdlGURTo\n"
2023-11-09T11:42:46.686037Z INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2023-11-09T11:42:47.586457Z ERROR linkerd_network_validator: Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Host is unreachable (os error 113)
2023-11-09T11:42:47.586481Z ERROR linkerd_network_validator: error=Host is unreachable (os error 113)
output of linkerd check -o short
linkerd-existence
-----------------
- No running pods for "linkerd-destination" ^C
Environment
- Kubernetes-version: 1.28.3
- Cilium version: 1.14.3
- Linkerd-cni-version: stable-2.14.3
- Linkerd-version: stable-2.14.3
- OS: Flatcar Openstack 3510.2.8
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
maybe
@matthiasdeblock hi, sounds like the validator is detecting erroneous configuration in your network stack. The validator is attempting to connect to a server it creates in order to test iptables destination re-writing works as expected. I see that you're using Cilium. We have a cluster configuration section in our docs aimed at getting Linkerd to work with Cilium. Their socket level load balancing capability can sometimes mess up routing for other services. Can you check if that's affecting you here?
Hi @mateiidavid I did set 'bpf-lb-sock-hostns-only: "true"' but that did not fix the issue here. Without linkerd-cni everything is working fine.
If you think linkerd-cni is the culprit, I'd suggest having a look at some logs. Specifically:
- Does the installer (linkerd-cni daemonset pod) report anything?
- Can you get access to kubelet logs to verify whether plugin runs have unsuccessful?
- Does your CNI host configuration file contain linkerd-cni's configuration?
I'd perhaps start with the last one if it's easy. It might be that the configuration wasn't appended properly for some reason.
If you think linkerd-cni is the culprit, I'd suggest having a look at some logs. Specifically:
- Does the installer (linkerd-cni daemonset pod) report anything?
- Can you get access to kubelet logs to verify whether plugin runs have unsuccessful?
- Does your CNI host configuration file contain linkerd-cni's configuration?
I'd perhaps start with the last one if it's easy. It might be that the configuration wasn't appended properly for some reason.
I'll give it a retry next week. I did check all these but I'll give it another look:
- The cni pods did not report any issues
- The plugin just installs correctly and is up and running withing couple of seconds
- The CNI config file mentioned the location of the Cilium CNI plugin conf.
I'll verify this by the beginning of next week.
Regards
@matthiasdeblock Any joy retrying this?
@matthiasdeblock Happy new year! Still curious if you got a chance to retry things? 🙂
Hi Sorry for the delay, we will be testing again in the upcoming days. Regards Matthias
Hi,
As a colleague of @matthiasdeblock i'd like to give some extra info about this issue: Logs of the cni pod:
[2024-04-03 09:12:34] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2024-04-03 09:12:34] Installing CNI configuration for /host/etc/cni/net.d/05-cilium.conflist
[2024-04-03 09:12:34] Using CNI config template from CNI_NETWORK_CONFIG environment variable.
"k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
"k8s_api_root": "https://10.12.0.1:__KUBERNETES_SERVICE_PORT__",
[2024-04-03 09:12:34] CNI config: {
"name": "linkerd-cni",
"type": "linkerd-cni",
"log_level": "info",
"policy": {
"type": "k8s",
"k8s_api_root": "https://10.12.0.1:443",
"k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
},
"linkerd": {
"incoming-proxy-port": 4143,
"outgoing-proxy-port": 4140,
"proxy-uid": 2102,
"ports-to-redirect": [],
"inbound-ports-to-ignore": ["4191","4190"],
"simulate": false,
"use-wait-flag": false
}
}
[2024-04-03 09:12:34] Created CNI config /host/etc/cni/net.d/05-cilium.conflist
Setting up watches.
Watches established.
Looks like the config doesn't get written to the file, contents of /etc/cni/net.d/05-cilium.conflist
{
"cniVersion": "0.3.1",
"name": "cilium",
"plugins": [
{
"type": "cilium-cni",
"enable-debug": false,
"log-file": "/var/run/cilium/cilium-cni.log"
}
]
}
Hi
As our cluster is air-gapped I noticed the 1.1.1.1 as connection address isn't correct. I've fixed this in our helm chart and we are now getting a bit further but still running into an error:
flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-56f777c8b6-8sw9c -c linkerd-network-validator
2024-04-05T05:33:28.979251Z INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-04-05T05:33:28.979293Z DEBUG linkerd_network_validator: token="y3SgDWabwG6jtxhXFrYYBB4cSHHiSKjbSsaDV29f89tkwrWjmJXtvMz9lmyWb5p\n"
2024-04-05T05:33:28.979308Z INFO linkerd_network_validator: Connecting to <kubernetes_api>:6443
2024-04-05T05:33:28.981087Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.70.197:57332
2024-04-05T05:33:38.980507Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s
(using the kubernetes api IP to connect to)
So it now connects but is still throwing an error.
Regards Matthias
Thanks @matthiasdeblock for circling back. With version 2.14.9 we have added a cni-repair-controller component that should detect race conditions between the cluster's cni and linkerd-cni. You can enable it via the linkerd2-cni chart value repairController.enabled=true. If that doesn't do the trick, there's another fix in linkerd/linkerd2-proxy-init#360 that might work for you, so please let me know and I can provide an image to test that out.
Thanks @matthiasdeblock for circling back. With version 2.14.9 we have added a cni-repair-controller component that should detect race conditions between the cluster's cni and linkerd-cni. You can enable it via the linkerd2-cni chart value repairController.enabled=true. If that doesn't do the trick, there's another fix in linkerd/linkerd2-proxy-init#360 that might work for you, so please let me know and I can provide an image to test that out.
Hi
The cni-repair-controller just keeps restarting the linkerd control plane. This isn't fixing the issue.
You have linked https://github.com/linkerd/linkerd2-proxy-init/pull/362 as well, can this be the issue we are running into?
Regards Matthias
I linked linkerd/linkerd2-proxy-init#362 by mistake. That should be unrelated unless you're using native sidecars too.
I've published the image ghcr.io/alpeb/cni-plugin:modify
with the change from linkerd/linkerd2-proxy-init#360. It would be great if you could give that a try.
Hi I have tested the image you provided but it still throws me the same error:
flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-6c49f479d8-946ww -c linkerd-network-validator
2024-04-19T09:51:52.016640Z INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-04-19T09:51:52.016683Z DEBUG linkerd_network_validator: token="CMiU50KsdnCBztqVH5xUXcVHqbfhqE960BJEpwoj5GTJLiftg9qQJ3JmT6KLssx\n"
2024-04-19T09:51:52.016689Z INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-04-19T09:51:52.018141Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.71.143:44382
2024-04-19T09:52:02.017854Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s
@mateiidavid , any news on this one?
@matthiasdeblock sorry, I think this was closed automatically when I hit the merge button on the PR above. Since it did not fix your issue, I'm going to re-open this.
Hi @mateiidavid Any news on this one?
I have changed the timeout from 10s to 60s and now I am getting a different error:
flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-f7b89b9db-qjxb7 -c linkerd-network-validator -f
2024-06-06T09:52:05.455607Z INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-06-06T09:52:05.455666Z DEBUG linkerd_network_validator: token="8NAaWTB0bQ7E5FcrUPpyWs8OOdpq1xnlMJElrWZ9RrN3ssRWdPSvVVBDwnykGOQ\n"
2024-06-06T09:52:05.455762Z INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-06-06T09:52:05.456775Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.69.107:47754
2024-06-06T09:52:37.458317Z DEBUG connect: linkerd_network_validator: Read message from server bytes=0
2024-06-06T09:52:37.458513Z DEBUG linkerd_network_validator: data="" size=0
2024-06-06T09:52:37.458543Z ERROR linkerd_network_validator: error=expected client to receive "8NAaWTB0bQ7E5FcrUPpyWs8OOdpq1xnlMJElrWZ9RrN3ssRWdPSvVVBDwnykGOQ\n"; got "" instead
So, it is still the same connecting address 172.24.214.93:6443 which is our kubernetes-api but it is now throwing another error...
Thank you! Regards Matthias
Hi
I have changed linkerd to the latest edge-24.5.5 and CNI to 1.5.0. Also have been putting the timeout to 30s. Still the same issue:
flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-749d567f64-rnmhl -c linkerd-network-validator -f
2024-06-06T12:02:02.055672Z INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-06-06T12:02:02.055715Z DEBUG linkerd_network_validator: token="FxnawK939yIxs5SAvEnQ9ii4QLecvKoWZRgGRMgOcrzwwRaWCyIbaxzorU79K5G\n"
2024-06-06T12:02:02.055729Z INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-06-06T12:02:02.057521Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.69.211:47982
2024-06-06T12:02:32.057580Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=30s
Hi
I've been looking into this myself a bit better and I found the issue here. It seems like the Cilium needed this config:
cni.exclusive=false
cni-exclusive: "false"
What this means: make Cilium take ownership over the /etc/cni/net.d directory on the node, renaming all non-Cilium CNI configurations to *.cilium_bak. This ensures no Pods can be scheduled using other CNI plugins during Cilium agent downtime.
Source: https://docs.cilium.io/en/stable/helm-reference/
Thanks for the feedback @matthiasdeblock ! I've confirmed the fix and pushed some updates to our docs.