linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Linkerd-destination: unable to connect to validator

Open matthiasdeblock opened this issue 1 year ago • 19 comments

What is the issue?

Hi

After installing linkerd-cni. the Linkerd pods are unable to start due to the following error:

Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Host is unreachable (os error 113)

How can it be reproduced?

Install linkerd-cni and linkerd on a flatcar kubernetes 1.28.3 cluster with cilium as CNI.

Logs, error output, etc

2023-11-09T11:42:46.686000Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2023-11-09T11:42:46.686030Z DEBUG linkerd_network_validator: token="KXyajGp2VZRdLXMQEEAqBJoJUeNIUUUhajU7NmAqDTmCn9fcj9GyrFcDdlGURTo\n"
2023-11-09T11:42:46.686037Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2023-11-09T11:42:47.586457Z ERROR linkerd_network_validator: Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Host is unreachable (os error 113)
2023-11-09T11:42:47.586481Z ERROR linkerd_network_validator: error=Host is unreachable (os error 113)

output of linkerd check -o short

linkerd-existence
-----------------
- No running pods for "linkerd-destination" ^C

Environment

  • Kubernetes-version: 1.28.3
  • Cilium version: 1.14.3
  • Linkerd-cni-version: stable-2.14.3
  • Linkerd-version: stable-2.14.3
  • OS: Flatcar Openstack 3510.2.8

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

matthiasdeblock avatar Nov 09 '23 12:11 matthiasdeblock

@matthiasdeblock hi, sounds like the validator is detecting erroneous configuration in your network stack. The validator is attempting to connect to a server it creates in order to test iptables destination re-writing works as expected. I see that you're using Cilium. We have a cluster configuration section in our docs aimed at getting Linkerd to work with Cilium. Their socket level load balancing capability can sometimes mess up routing for other services. Can you check if that's affecting you here?

mateiidavid avatar Nov 09 '23 14:11 mateiidavid

Hi @mateiidavid I did set 'bpf-lb-sock-hostns-only: "true"' but that did not fix the issue here. Without linkerd-cni everything is working fine.

matthiasdeblock avatar Nov 09 '23 14:11 matthiasdeblock

If you think linkerd-cni is the culprit, I'd suggest having a look at some logs. Specifically:

  • Does the installer (linkerd-cni daemonset pod) report anything?
  • Can you get access to kubelet logs to verify whether plugin runs have unsuccessful?
  • Does your CNI host configuration file contain linkerd-cni's configuration?

I'd perhaps start with the last one if it's easy. It might be that the configuration wasn't appended properly for some reason.

mateiidavid avatar Nov 10 '23 10:11 mateiidavid

If you think linkerd-cni is the culprit, I'd suggest having a look at some logs. Specifically:

  • Does the installer (linkerd-cni daemonset pod) report anything?
  • Can you get access to kubelet logs to verify whether plugin runs have unsuccessful?
  • Does your CNI host configuration file contain linkerd-cni's configuration?

I'd perhaps start with the last one if it's easy. It might be that the configuration wasn't appended properly for some reason.

I'll give it a retry next week. I did check all these but I'll give it another look:

  • The cni pods did not report any issues
  • The plugin just installs correctly and is up and running withing couple of seconds
  • The CNI config file mentioned the location of the Cilium CNI plugin conf.

I'll verify this by the beginning of next week.

Regards

matthiasdeblock avatar Nov 20 '23 16:11 matthiasdeblock

@matthiasdeblock Any joy retrying this?

kflynn avatar Dec 18 '23 21:12 kflynn

@matthiasdeblock Happy new year! Still curious if you got a chance to retry things? 🙂

kflynn avatar Jan 04 '24 15:01 kflynn

Hi Sorry for the delay, we will be testing again in the upcoming days. Regards Matthias

matthiasdeblock avatar Mar 19 '24 14:03 matthiasdeblock

Hi,

As a colleague of @matthiasdeblock i'd like to give some extra info about this issue: Logs of the cni pod:

[2024-04-03 09:12:34] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2024-04-03 09:12:34] Installing CNI configuration for /host/etc/cni/net.d/05-cilium.conflist
[2024-04-03 09:12:34] Using CNI config template from CNI_NETWORK_CONFIG environment variable.
      "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://10.12.0.1:__KUBERNETES_SERVICE_PORT__",
[2024-04-03 09:12:34] CNI config: {
  "name": "linkerd-cni",
  "type": "linkerd-cni",
  "log_level": "info",
  "policy": {
      "type": "k8s",
      "k8s_api_root": "https://10.12.0.1:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
  },
  "kubernetes": {
      "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },
  "linkerd": {
    "incoming-proxy-port": 4143,
    "outgoing-proxy-port": 4140,
    "proxy-uid": 2102,
    "ports-to-redirect": [],
    "inbound-ports-to-ignore": ["4191","4190"],
    "simulate": false,
    "use-wait-flag": false
  }
}
[2024-04-03 09:12:34] Created CNI config /host/etc/cni/net.d/05-cilium.conflist
Setting up watches.
Watches established.

Looks like the config doesn't get written to the file, contents of /etc/cni/net.d/05-cilium.conflist

{
  "cniVersion": "0.3.1",
  "name": "cilium",
  "plugins": [
    {
       "type": "cilium-cni",
       "enable-debug": false,
       "log-file": "/var/run/cilium/cilium-cni.log"
    }
  ]
}

Driesvanherpe avatar Apr 03 '24 09:04 Driesvanherpe

Hi

As our cluster is air-gapped I noticed the 1.1.1.1 as connection address isn't correct. I've fixed this in our helm chart and we are now getting a bit further but still running into an error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-56f777c8b6-8sw9c -c linkerd-network-validator

2024-04-05T05:33:28.979251Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-04-05T05:33:28.979293Z DEBUG linkerd_network_validator: token="y3SgDWabwG6jtxhXFrYYBB4cSHHiSKjbSsaDV29f89tkwrWjmJXtvMz9lmyWb5p\n"
2024-04-05T05:33:28.979308Z  INFO linkerd_network_validator: Connecting to <kubernetes_api>:6443
2024-04-05T05:33:28.981087Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.70.197:57332
2024-04-05T05:33:38.980507Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

(using the kubernetes api IP to connect to)

So it now connects but is still throwing an error.

Regards Matthias

matthiasdeblock avatar Apr 05 '24 05:04 matthiasdeblock

Thanks @matthiasdeblock for circling back. With version 2.14.9 we have added a cni-repair-controller component that should detect race conditions between the cluster's cni and linkerd-cni. You can enable it via the linkerd2-cni chart value repairController.enabled=true. If that doesn't do the trick, there's another fix in linkerd/linkerd2-proxy-init#360 that might work for you, so please let me know and I can provide an image to test that out.

alpeb avatar Apr 10 '24 16:04 alpeb

Thanks @matthiasdeblock for circling back. With version 2.14.9 we have added a cni-repair-controller component that should detect race conditions between the cluster's cni and linkerd-cni. You can enable it via the linkerd2-cni chart value repairController.enabled=true. If that doesn't do the trick, there's another fix in linkerd/linkerd2-proxy-init#360 that might work for you, so please let me know and I can provide an image to test that out.

Hi

The cni-repair-controller just keeps restarting the linkerd control plane. This isn't fixing the issue.

You have linked https://github.com/linkerd/linkerd2-proxy-init/pull/362 as well, can this be the issue we are running into?

Regards Matthias

matthiasdeblock avatar Apr 11 '24 11:04 matthiasdeblock

I linked linkerd/linkerd2-proxy-init#362 by mistake. That should be unrelated unless you're using native sidecars too. I've published the image ghcr.io/alpeb/cni-plugin:modify with the change from linkerd/linkerd2-proxy-init#360. It would be great if you could give that a try.

alpeb avatar Apr 11 '24 13:04 alpeb

Hi I have tested the image you provided but it still throws me the same error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-6c49f479d8-946ww -c linkerd-network-validator
2024-04-19T09:51:52.016640Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-04-19T09:51:52.016683Z DEBUG linkerd_network_validator: token="CMiU50KsdnCBztqVH5xUXcVHqbfhqE960BJEpwoj5GTJLiftg9qQJ3JmT6KLssx\n"
2024-04-19T09:51:52.016689Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-04-19T09:51:52.018141Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.71.143:44382
2024-04-19T09:52:02.017854Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

matthiasdeblock avatar Apr 19 '24 09:04 matthiasdeblock

@mateiidavid , any news on this one?

matthiasdeblock avatar May 03 '24 10:05 matthiasdeblock

@matthiasdeblock sorry, I think this was closed automatically when I hit the merge button on the PR above. Since it did not fix your issue, I'm going to re-open this.

mateiidavid avatar May 03 '24 10:05 mateiidavid

Hi @mateiidavid Any news on this one?

I have changed the timeout from 10s to 60s and now I am getting a different error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-f7b89b9db-qjxb7 -c linkerd-network-validator -f
2024-06-06T09:52:05.455607Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-06-06T09:52:05.455666Z DEBUG linkerd_network_validator: token="8NAaWTB0bQ7E5FcrUPpyWs8OOdpq1xnlMJElrWZ9RrN3ssRWdPSvVVBDwnykGOQ\n"
2024-06-06T09:52:05.455762Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-06-06T09:52:05.456775Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.69.107:47754
2024-06-06T09:52:37.458317Z DEBUG connect: linkerd_network_validator: Read message from server bytes=0
2024-06-06T09:52:37.458513Z DEBUG linkerd_network_validator: data="" size=0
2024-06-06T09:52:37.458543Z ERROR linkerd_network_validator: error=expected client to receive "8NAaWTB0bQ7E5FcrUPpyWs8OOdpq1xnlMJElrWZ9RrN3ssRWdPSvVVBDwnykGOQ\n"; got "" instead

So, it is still the same connecting address 172.24.214.93:6443 which is our kubernetes-api but it is now throwing another error...

Thank you! Regards Matthias

matthiasdeblock avatar Jun 06 '24 09:06 matthiasdeblock

Hi

I have changed linkerd to the latest edge-24.5.5 and CNI to 1.5.0. Also have been putting the timeout to 30s. Still the same issue:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-749d567f64-rnmhl -c linkerd-network-validator -f
2024-06-06T12:02:02.055672Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-06-06T12:02:02.055715Z DEBUG linkerd_network_validator: token="FxnawK939yIxs5SAvEnQ9ii4QLecvKoWZRgGRMgOcrzwwRaWCyIbaxzorU79K5G\n"
2024-06-06T12:02:02.055729Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-06-06T12:02:02.057521Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.69.211:47982
2024-06-06T12:02:32.057580Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=30s

matthiasdeblock avatar Jun 06 '24 12:06 matthiasdeblock

Hi

I've been looking into this myself a bit better and I found the issue here. It seems like the Cilium needed this config:

cni.exclusive=false

cni-exclusive: "false"

What this means: make Cilium take ownership over the /etc/cni/net.d directory on the node, renaming all non-Cilium CNI configurations to *.cilium_bak. This ensures no Pods can be scheduled using other CNI plugins during Cilium agent downtime.

Source: https://docs.cilium.io/en/stable/helm-reference/

matthiasdeblock avatar Jun 07 '24 12:06 matthiasdeblock

Thanks for the feedback @matthiasdeblock ! I've confirmed the fix and pushed some updates to our docs.

alpeb avatar Jun 25 '24 15:06 alpeb