k3s icon indicating copy to clipboard operation
k3s copied to clipboard

Random webhook failures with EOF

Open ansarhun opened this issue 2 years ago • 18 comments

Environmental Info: K3s Version:

k3s version v1.23.8+k3s1 (53f2d4e7) go version go1.17.5

Node(s) CPU architecture, OS, and Version:

Linux k8s-controller-1 5.15.0-39-generic #42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server, 3 agent

Describe the bug:

When FluxCD tries to do a dry-run on resources where webhook is configured (e.g.: cert-manager) the dry-run fails with EOF error. On the webhook side it looks like it's not a proper https/tls connection: first record does not look like a TLS handshake

Steps To Reproduce:

  • Installed K3s:
        k3sup install --ip {{ ansible_default_ipv4.address }}
        --ssh-key {{ lookup('env','HOME') }}/.ssh/id_ed25519
        --user {{ ansible_ssh_user }}
        --k3s-extra-args '
          --disable traefik 
          --disable servicelb
          --disable-cloud-controller
          --node-taint CriticalAddonsOnly=true:NoExecute
        '
        --k3s-channel stable

        k3sup join --ip {{ ansible_default_ipv4.address }}
        --server-ip {{  hostvars[groups['master'][0]].ansible_default_ipv4.address }}
        --ssh-key {{ lookup('env','HOME') }}/.ssh/id_ed25519
        --user {{ ansible_ssh_user }}
        --k3s-channel stable

Expected behavior:

Dry run succeeds

Actual behavior:

Dry run fails sometimes with the same configuration

Additional context / logs:

Dry run error:

dry-run failed, reason: InternalError, error: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "[https://kube-prometheus-stack-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s](https://kube-prometheus-stack-operator.monitoring.svc/admission-prometheusrules/validate?timeout=10s)": EOF

k3s logs from journalctl

k3s[682]: W0710 08:03:32.023261     682 dispatcher.go:195] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.network.svc:443/mutate?timeout=10s": EOF

cert-manager webhook pod logs

I0710 08:03:32.022971       1 logs.go:59] http: TLS handshake error from 10.42.0.0:50900: tls: first record does not look like a TLS handshake

Backporting

  • [ ] Needs backporting to older releases

ansarhun avatar Jul 10 '22 15:07 ansarhun

Note that I found this issue by searching for first record does not look like a TLS handshake which is what I observe when debugging #5633. So, maybe worth trying workarounds like suggested there.

One person resolved their issue by passing --egress-selector-mode=agent, while I had success with --egress-selector-mode=disabled, and observed the need to set that if using a version newer than v1.22.10+k3s1, v1.23.7+k3s1, or v1.24.1+k3s1.

consideRatio avatar Jul 12 '22 10:07 consideRatio

@consideRatio thank you! I will try the egress option to see if that helps

ansarhun avatar Jul 12 '22 14:07 ansarhun

Try updating to a newer release. There were a few issues discovered with the egress-selector support which have since been addressed.

brandond avatar Jul 12 '22 17:07 brandond

@brandond I can reproduce the intermittent issues in the latest three minor releases.

Here are three intermittent failures including logs for the latest versions of k3s. The logs are available in a later "step" under journalctl -xe.

EDIT: the same issue is observed with the RC1 for 1.24.3, 1.23.9, and 1.22.12.

consideRatio avatar Jul 12 '22 19:07 consideRatio

Same result here. After upgrading to the latest stable (v1.23.8+k3s2) version the issue still occurs. For the egress-selector-mode option agent mode did not worked for me either but the disabled seems to work or at least I didn't notice any new error in the last 12h.

ansarhun avatar Jul 13 '22 04:07 ansarhun

We're about to release RCs for today's upstream patch releases, which will include some updates to the egress-selector code and remotedialer library. I'm curious if this will address the issues you're seeing.

brandond avatar Jul 13 '22 22:07 brandond

Looking forward to the RCs, I'm having the same issue as well on 1.24.2+k3s1 and 1.24.2+k3s2. Observing both the first record does not look like a TLS handshake message from the k3s logs and the error dialing backend: EOF message from kubectl. Just my luck that I've been testing a move from Calico to Cilium alongside an nftables config change around the time this issue occurred...

marshallford avatar Jul 14 '22 01:07 marshallford

I'd anyone can provide instructions to reproduce it on demand that'd be great. I've tried with Calico+Docker which there was a report of causing issues but haven't seen any problems yet.

brandond avatar Jul 14 '22 01:07 brandond

Upgrading to 1.24.3+k3s1 didn't fix my issue, however setting egress-selector-mode to disabled did seem to work.

marshallford avatar Jul 21 '22 01:07 marshallford

I suggest this is marked as a duplicate of https://github.com/k3s-io/k3s/issues/5633 where more discussion has been made.

consideRatio avatar Aug 09 '22 12:08 consideRatio

With the current workaround I did not notice any more errors, and I was unable to create a small project where I can reproduce the issue reliably.

ansarhun avatar Aug 10 '22 05:08 ansarhun

having exactly same issue. but for me it is more constant than random failures. I was trying to upgrade k3s from current version v1.21.11+k3s1 where I don't have such issues to v1.22.12+k3s1 and v1.23.9+k3s1, for both 1.22.x and 1.23.x I am experiencing issues with mutating / validating webhooks.

as an example I am running rancher on that k3s cluster and when I am trying to create new cluster with rancher I am constantly getting:

Internal error occurred: failed calling webhook "rancher.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s": EOF

I didn't try to set egress-selector-mode yet.

UPDATE:

I've also tried upgrade to v1.24.3+k3s1 and hit the same issue with webhooks, however setting egress-selector-mode: disabled seems fixes this issue, once I've set it to disable I do not see any failed requests to webhooks

riuvshyn avatar Aug 22 '22 11:08 riuvshyn

@riuvshyn can you provide more information on your configuration? Are you using the default flannel CNI, or have you replaced it with something else?

brandond avatar Aug 22 '22 18:08 brandond

For me, I'm using Calico.

onedr0p avatar Aug 22 '22 18:08 onedr0p

@brandond sure, I am using aws eni cni. my k3s config file:

egress-selector-mode: disabled

node-name: $${LOCAL_HOSTNAME}
flannel-backend: none
disable-cloud-controller: true
node-ip: $${LOCAL_IPV4}
node-external-ip: $${LOCAL_IPV4}
kube-apiserver-arg:
- "service-account-key-file=/etc/rancher/service-account.pub"
- "service-account-issuer=${service_account_issuer}"
- "service-account-signing-key-file=/etc/rancher/service-account.key"
- "enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction,Priority,TaintNodesByCondition,PersistentVolumeClaimResize,PodSecurityPolicy"
- "audit-log-path=/var/log/k8s-audit/audit.log"
- "audit-policy-file=/etc/rancher/k3s/audit_policy.yaml"
- "audit-log-maxage=90"
- "audit-log-maxbackup=10"
- "audit-log-maxsize=500"
kube-controller-arg:
- "service-account-private-key-file=/etc/rancher/service-account.key"
kubelet-arg:
- "cloud-provider=external"
- "provider-id=$${CLOUD_CONTROLLER_PROVIDER_ID}"

tls-san:
- "${k3s_tls_san}"
no-deploy:
- "traefik"
- "metrics-server"
- "servicelb"

riuvshyn avatar Aug 25 '22 07:08 riuvshyn

My understanding is that this kind of failure (assumed the same as #5633) has been seen on a wide variety of configurations of k3s. My overview is that we have seen reports of this occur for:

  • CNI choices: flannel, Calico, AWS ENI
  • K3s versions: >=1.22.10-rc1, >=1.23.7-rc1, >=1.24.0-rc2
  • Use and without use of --docker (and cri-dockerd)
  • With and without NetworkPolicy resources in the k8s cluster

consideRatio avatar Aug 25 '22 08:08 consideRatio

Also impacted by this issue on K3s with Calico, Flannel and Wireguard. Version: v1.24.3+k3s1

$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
Error from server: error dialing backend: EOF

$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
Error from server: error dialing backend: EOF

$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
argocd@argocd-repo-server-695669547d-n2scn:~$

blackandred avatar Sep 19 '22 09:09 blackandred

@blackandred can you start k3s with the --debug flag, and see if you get anything interesting in the logs when those exec commands fail? I continue to be unable to reproduce this on demand on recent releases.

brandond avatar Sep 19 '22 19:09 brandond

I can reproduce the issue on v1.25.0+k3s1 using Alpine Linux v3.16 , containerd://1.6.8-k3s1 for the container runtime, and calico as CNI. Need to present two demos tomorrow so I don't want to mess with my cluster right now but once that's done I'll try with --debug and doing A/B tests with the different egress-selector-mode options.

kbpersonal avatar Sep 30 '22 00:09 kbpersonal

I have run into the same issue on K3S 1.23.10+k3s1 and 1.22.15+k3s1. We frequently run into it when utilizing port-forwards via the Kubernetes API, but we have also seen it when running execs. We are using a vanilla K3S instance with no custom configs, except for using the Docker container runtime.

Here is the command we use to start the K3S server via systemd:

k3s server --docker

We have no additional custom configurations. This is the command used to install K3S within the VM.

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.23.10+k3s1" sh -s - --docker

This is on an Ubuntu 20.04 LTS VM.

Here are the debug logs from the latest encounter of the error:

TP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35478"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35482"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: I1004 11:13:02.644540  723111 log.go:184] http: TLS handshake error from 127.0.0.1:49908: tls: first record does not look like a TLS handshake
Oct 04 11:13:02 ubuntu k3s[723111]: E1004 11:13:02.644893  723111 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"error dialing backend: EOF"}: error dialing backend: EOF
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35492"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:03 ubuntu k3s[723111]: time="2022-10-04T11:13:03-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:55392"
Oct 04 11:13:03 ubuntu k3s[723111]: time="2022-10-04T11:13:03-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:04 ubuntu k3s[723111]: time="2022-10-04T11:13:04-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:55406"
Oct 04 11:13:04 ubuntu k3s[723111]: time="2022-10-04T11:13:04-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"

Setting egress-selector-mode: disabled in the k3s config seemed to resolve it.

I don't have a 100% guaranteed method of reproducing it, but for me, the easiest way is to execute 100 or so exec commands in a row. Usually, most of them will pass, but a few will produce the EOF errors.

for  i in $(seq 1 100); do kubectl exec <pod-name> -- ls -la > /dev/null; done

james-mchugh avatar Oct 04 '22 15:10 james-mchugh

Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35482"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: I1004 11:13:02.644540  723111 log.go:184] http: TLS handshake error from 127.0.0.1:49908: tls: first record does not look like a TLS handshake
Oct 04 11:13:02 ubuntu k3s[723111]: E1004 11:13:02.644893  723111 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"error dialing backend: EOF"}: error dialing backend: EOF

Is this a single-node cluster, and 172.16.67.5 is the local node? Do you see the same behavior when using the embedded containerd instead of docker?

brandond avatar Oct 04 '22 17:10 brandond

This is a single node cluster, and yes, 172.16.67.5 is the local node. I do see the same issue when using the embedded containerd. I know of two other users between my team and teams we've interfaced with who have seen similar issues. Here are the details:

User 1:

  • Ubuntu 20.04 VM,
  • K3S 1.22.x and 1.23.x (I don't have more specific version details, but they attempted both minor versions and ran into the issues in both of them)
  • Docker Runtime
  • Single Node
  • No additional configs

User 2:

  • Ubuntu 22.04 VM,
  • K3S v1.23.10+k3s1,
  • Containerd Runtime
  • Single Node
  • No additional configs

james-mchugh avatar Oct 04 '22 21:10 james-mchugh

What kind of cpu/memory/disk resources do these nodes have? What pod are you execing into; is one one of the packaged ones or something that you've deployed yourself? Are you running the kubectl command locally, or on another host?

I've tried load-testing both logs and exec requests, and haven't been able to trigger it on any of my dev hosts, or on an EC2 instance - so anything that might help me hit it on demand would be great.

I'm testing by doing this, which I believe should be pretty similar to what you're doing:

while true; do kubectl exec -n kube-system local-path-provisioner-5b5579c644-sphmc -- ls -la >/dev/null; done

I can leave that running for hours and not see any issue :(

brandond avatar Oct 04 '22 21:10 brandond

I've run into the issue on a machine with the following specs:

My Machine:

  • Host OS: macOS Monterey 12.6
  • Host CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz 16 cores
  • Host Memory: 64 GB DDR4 2667 MHZ
  • Host Storage: 2TB SSD
  • Guest OS: Ubuntu 20.04
  • Guest CPU: 8 cores
  • Guest Memory: 32 GB
  • Guest Storage: 250 GB

I've run into the issue when running kubectl commands on the host machine and within the Guest VM where Kubernetes is running. I have run into it with pods I have deployed and pods in the kube-system namespace such as the local-path-provisioner as you used in your example.

I don't have all of the details on the other users, but here is what I have

User 1:

  • Host OS: Windows 10
  • Guest OS: Ubuntu 20.04
  • Guest CPU Cores: 8
  • Guest Memory: 32 GB

User 2:

  • Host OS: ESXi
  • Guest OS: Ubuntu 22.04
  • Guest CPU Cores: 8
  • Guest Memory: 32 GB

james-mchugh avatar Oct 04 '22 22:10 james-mchugh

Thanks for the info!

brandond avatar Oct 04 '22 22:10 brandond

I tested this on my other computer where the host OS runs Arch Linux. I have not seen the error on this machine. Here are the specs:

  • OS: Arch Linux
  • Linux Kernel: 5.18.12-arch-1
  • CPU: Intel Core i9-7900x (20 cores)
  • Memory: 32GB DDR4
  • Storage (/var partition): 50 GB SSD
  • K3S: 1.24.6+k3s1
  • Container Runtime: Embedded Containerd
  • K3S Configs: None

james-mchugh avatar Oct 05 '22 14:10 james-mchugh

We're also seeing this on Ubuntu 20.04 Kernel v5.13.0-37 from the HWE line. Are we potentially looking into this being a K3s<>Ubuntu combo issue?

LarsBingBong avatar Oct 05 '22 14:10 LarsBingBong

We're also seeing this on Ubuntu 20.04 Kernel v5.13.0-37 from the HWE line. Are we potentially looking into this being a K3s<>Ubuntu combo issue?

That is doubtful, I am seeing the issue using Fedora Server as well.

onedr0p avatar Oct 05 '22 14:10 onedr0p

I have been able to reproduce this on a node with more cores than my usual dev host and testing instances have, which suggests there may be some sort of race condition. It looks like something is occasionally consuming the first byte of the proxied TLS handshake, which of course breaks things. It feels a lot like https://github.com/golang/go/issues/27408 which had some fixes merged a while back but is still open?

brandond avatar Oct 05 '22 16:10 brandond

Oh that's some deep level goodies nerdy stuff. However, that issue is from the wayback machine AKA year 2018. Of course it's hard to say - but what is the chances of that getting solved? But, yes - what go#27408 points to sounds quite interesting.

LarsBingBong avatar Oct 05 '22 17:10 LarsBingBong