k3s
k3s copied to clipboard
Random webhook failures with EOF
Environmental Info: K3s Version:
k3s version v1.23.8+k3s1 (53f2d4e7) go version go1.17.5
Node(s) CPU architecture, OS, and Version:
Linux k8s-controller-1 5.15.0-39-generic #42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
1 server, 3 agent
Describe the bug:
When FluxCD tries to do a dry-run on resources where webhook is configured (e.g.: cert-manager) the dry-run fails with EOF error.
On the webhook side it looks like it's not a proper https/tls connection: first record does not look like a TLS handshake
Steps To Reproduce:
- Installed K3s:
k3sup install --ip {{ ansible_default_ipv4.address }}
--ssh-key {{ lookup('env','HOME') }}/.ssh/id_ed25519
--user {{ ansible_ssh_user }}
--k3s-extra-args '
--disable traefik
--disable servicelb
--disable-cloud-controller
--node-taint CriticalAddonsOnly=true:NoExecute
'
--k3s-channel stable
k3sup join --ip {{ ansible_default_ipv4.address }}
--server-ip {{ hostvars[groups['master'][0]].ansible_default_ipv4.address }}
--ssh-key {{ lookup('env','HOME') }}/.ssh/id_ed25519
--user {{ ansible_ssh_user }}
--k3s-channel stable
Expected behavior:
Dry run succeeds
Actual behavior:
Dry run fails sometimes with the same configuration
Additional context / logs:
Dry run error:
dry-run failed, reason: InternalError, error: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "[https://kube-prometheus-stack-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s](https://kube-prometheus-stack-operator.monitoring.svc/admission-prometheusrules/validate?timeout=10s)": EOF
k3s logs from journalctl
k3s[682]: W0710 08:03:32.023261 682 dispatcher.go:195] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.network.svc:443/mutate?timeout=10s": EOF
cert-manager webhook pod logs
I0710 08:03:32.022971 1 logs.go:59] http: TLS handshake error from 10.42.0.0:50900: tls: first record does not look like a TLS handshake
Backporting
- [ ] Needs backporting to older releases
Note that I found this issue by searching for first record does not look like a TLS handshake
which is what I observe when debugging #5633. So, maybe worth trying workarounds like suggested there.
One person resolved their issue by passing --egress-selector-mode=agent
, while I had success with --egress-selector-mode=disabled
, and observed the need to set that if using a version newer than v1.22.10+k3s1, v1.23.7+k3s1, or v1.24.1+k3s1.
@consideRatio thank you! I will try the egress option to see if that helps
Try updating to a newer release. There were a few issues discovered with the egress-selector support which have since been addressed.
@brandond I can reproduce the intermittent issues in the latest three minor releases.
Here are three intermittent failures including logs for the latest versions of k3s. The logs are available in a later "step" under journalctl -xe
.
- v1.24.2+k3s2 - CI failure with the error reported by
kubectl
, and the same error reported via k3s injournalctl -xe
- v1.23.8+k3s2 - CI failure with the error reported by
kubectl
, and the same error reported via k3s injournalctl -xe
- v1.22.11+k3s2 - CI failure with the error reported by
kubectl
, and the same error reported via k3s injournalctl -xe
EDIT: the same issue is observed with the RC1 for 1.24.3, 1.23.9, and 1.22.12.
Same result here.
After upgrading to the latest stable (v1.23.8+k3s2) version the issue still occurs.
For the egress-selector-mode
option agent
mode did not worked for me either but the disabled
seems to work or at least I didn't notice any new error in the last 12h.
We're about to release RCs for today's upstream patch releases, which will include some updates to the egress-selector code and remotedialer library. I'm curious if this will address the issues you're seeing.
Looking forward to the RCs, I'm having the same issue as well on 1.24.2+k3s1
and 1.24.2+k3s2
. Observing both the first record does not look like a TLS handshake
message from the k3s logs and the error dialing backend: EOF
message from kubectl. Just my luck that I've been testing a move from Calico to Cilium alongside an nftables config change around the time this issue occurred...
I'd anyone can provide instructions to reproduce it on demand that'd be great. I've tried with Calico+Docker which there was a report of causing issues but haven't seen any problems yet.
Upgrading to 1.24.3+k3s1
didn't fix my issue, however setting egress-selector-mode
to disabled
did seem to work.
I suggest this is marked as a duplicate of https://github.com/k3s-io/k3s/issues/5633 where more discussion has been made.
With the current workaround I did not notice any more errors, and I was unable to create a small project where I can reproduce the issue reliably.
having exactly same issue. but for me it is more constant than random failures.
I was trying to upgrade k3s from current version v1.21.11+k3s1
where I don't have such issues to v1.22.12+k3s1
and v1.23.9+k3s1
, for both 1.22.x
and 1.23.x
I am experiencing issues with mutating / validating webhooks.
as an example I am running rancher on that k3s cluster and when I am trying to create new cluster with rancher I am constantly getting:
Internal error occurred: failed calling webhook "rancher.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s": EOF
I didn't try to set egress-selector-mode
yet.
UPDATE:
I've also tried upgrade to v1.24.3+k3s1
and hit the same issue with webhooks, however setting egress-selector-mode: disabled
seems fixes this issue, once I've set it to disable
I do not see any failed requests to webhooks
@riuvshyn can you provide more information on your configuration? Are you using the default flannel CNI, or have you replaced it with something else?
For me, I'm using Calico.
@brandond sure, I am using aws eni cni
. my k3s config file:
egress-selector-mode: disabled
node-name: $${LOCAL_HOSTNAME}
flannel-backend: none
disable-cloud-controller: true
node-ip: $${LOCAL_IPV4}
node-external-ip: $${LOCAL_IPV4}
kube-apiserver-arg:
- "service-account-key-file=/etc/rancher/service-account.pub"
- "service-account-issuer=${service_account_issuer}"
- "service-account-signing-key-file=/etc/rancher/service-account.key"
- "enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction,Priority,TaintNodesByCondition,PersistentVolumeClaimResize,PodSecurityPolicy"
- "audit-log-path=/var/log/k8s-audit/audit.log"
- "audit-policy-file=/etc/rancher/k3s/audit_policy.yaml"
- "audit-log-maxage=90"
- "audit-log-maxbackup=10"
- "audit-log-maxsize=500"
kube-controller-arg:
- "service-account-private-key-file=/etc/rancher/service-account.key"
kubelet-arg:
- "cloud-provider=external"
- "provider-id=$${CLOUD_CONTROLLER_PROVIDER_ID}"
tls-san:
- "${k3s_tls_san}"
no-deploy:
- "traefik"
- "metrics-server"
- "servicelb"
My understanding is that this kind of failure (assumed the same as #5633) has been seen on a wide variety of configurations of k3s. My overview is that we have seen reports of this occur for:
- CNI choices: flannel, Calico, AWS ENI
- K3s versions: >=1.22.10-rc1, >=1.23.7-rc1, >=1.24.0-rc2
- Use and without use of
--docker
(and cri-dockerd) - With and without NetworkPolicy resources in the k8s cluster
Also impacted by this issue on K3s with Calico, Flannel and Wireguard.
Version: v1.24.3+k3s1
$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
Error from server: error dialing backend: EOF
$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
Error from server: error dialing backend: EOF
$ kubectl exec -it argocd-repo-server-695669547d-n2scn -- bash
Defaulted container "argocd-repo-server" out of: argocd-repo-server, copyutil (init)
argocd@argocd-repo-server-695669547d-n2scn:~$
@blackandred can you start k3s with the --debug
flag, and see if you get anything interesting in the logs when those exec commands fail? I continue to be unable to reproduce this on demand on recent releases.
I can reproduce the issue on v1.25.0+k3s1 using Alpine Linux v3.16 , containerd://1.6.8-k3s1 for the container runtime, and calico as CNI. Need to present two demos tomorrow so I don't want to mess with my cluster right now but once that's done I'll try with --debug
and doing A/B tests with the different egress-selector-mode
options.
I have run into the same issue on K3S 1.23.10+k3s1 and 1.22.15+k3s1. We frequently run into it when utilizing port-forwards via the Kubernetes API, but we have also seen it when running execs. We are using a vanilla K3S instance with no custom configs, except for using the Docker container runtime.
Here is the command we use to start the K3S server via systemd:
k3s server --docker
We have no additional custom configurations. This is the command used to install K3S within the VM.
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.23.10+k3s1" sh -s - --docker
This is on an Ubuntu 20.04 LTS VM.
Here are the debug logs from the latest encounter of the error:
TP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35478"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35482"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: I1004 11:13:02.644540 723111 log.go:184] http: TLS handshake error from 127.0.0.1:49908: tls: first record does not look like a TLS handshake
Oct 04 11:13:02 ubuntu k3s[723111]: E1004 11:13:02.644893 723111 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"error dialing backend: EOF"}: error dialing backend: EOF
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35492"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:03 ubuntu k3s[723111]: time="2022-10-04T11:13:03-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:55392"
Oct 04 11:13:03 ubuntu k3s[723111]: time="2022-10-04T11:13:03-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:04 ubuntu k3s[723111]: time="2022-10-04T11:13:04-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:55406"
Oct 04 11:13:04 ubuntu k3s[723111]: time="2022-10-04T11:13:04-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Setting egress-selector-mode: disabled
in the k3s config seemed to resolve it.
I don't have a 100% guaranteed method of reproducing it, but for me, the easiest way is to execute 100 or so exec
commands in a row. Usually, most of them will pass, but a few will produce the EOF errors.
for i in $(seq 1 100); do kubectl exec <pod-name> -- ls -la > /dev/null; done
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server handing HTTP/1.1 CONNECT request for //172.16.67.5:10250 from 127.0.0.1:35482"
Oct 04 11:13:02 ubuntu k3s[723111]: time="2022-10-04T11:13:02-04:00" level=debug msg="Tunnel server egress proxy dialing 127.0.0.1:10250 directly"
Oct 04 11:13:02 ubuntu k3s[723111]: I1004 11:13:02.644540 723111 log.go:184] http: TLS handshake error from 127.0.0.1:49908: tls: first record does not look like a TLS handshake
Oct 04 11:13:02 ubuntu k3s[723111]: E1004 11:13:02.644893 723111 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"error dialing backend: EOF"}: error dialing backend: EOF
Is this a single-node cluster, and 172.16.67.5 is the local node? Do you see the same behavior when using the embedded containerd instead of docker?
This is a single node cluster, and yes, 172.16.67.5 is the local node. I do see the same issue when using the embedded containerd. I know of two other users between my team and teams we've interfaced with who have seen similar issues. Here are the details:
User 1:
- Ubuntu 20.04 VM,
- K3S 1.22.x and 1.23.x (I don't have more specific version details, but they attempted both minor versions and ran into the issues in both of them)
- Docker Runtime
- Single Node
- No additional configs
User 2:
- Ubuntu 22.04 VM,
- K3S v1.23.10+k3s1,
- Containerd Runtime
- Single Node
- No additional configs
What kind of cpu/memory/disk resources do these nodes have? What pod are you execing into; is one one of the packaged ones or something that you've deployed yourself? Are you running the kubectl command locally, or on another host?
I've tried load-testing both logs and exec requests, and haven't been able to trigger it on any of my dev hosts, or on an EC2 instance - so anything that might help me hit it on demand would be great.
I'm testing by doing this, which I believe should be pretty similar to what you're doing:
while true; do kubectl exec -n kube-system local-path-provisioner-5b5579c644-sphmc -- ls -la >/dev/null; done
I can leave that running for hours and not see any issue :(
I've run into the issue on a machine with the following specs:
My Machine:
- Host OS: macOS Monterey 12.6
- Host CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz 16 cores
- Host Memory: 64 GB DDR4 2667 MHZ
- Host Storage: 2TB SSD
- Guest OS: Ubuntu 20.04
- Guest CPU: 8 cores
- Guest Memory: 32 GB
- Guest Storage: 250 GB
I've run into the issue when running kubectl commands on the host machine and within the Guest VM where Kubernetes is running. I have run into it with pods I have deployed and pods in the kube-system namespace such as the local-path-provisioner as you used in your example.
I don't have all of the details on the other users, but here is what I have
User 1:
- Host OS: Windows 10
- Guest OS: Ubuntu 20.04
- Guest CPU Cores: 8
- Guest Memory: 32 GB
User 2:
- Host OS: ESXi
- Guest OS: Ubuntu 22.04
- Guest CPU Cores: 8
- Guest Memory: 32 GB
Thanks for the info!
I tested this on my other computer where the host OS runs Arch Linux. I have not seen the error on this machine. Here are the specs:
- OS: Arch Linux
- Linux Kernel: 5.18.12-arch-1
- CPU: Intel Core i9-7900x (20 cores)
- Memory: 32GB DDR4
- Storage (/var partition): 50 GB SSD
- K3S: 1.24.6+k3s1
- Container Runtime: Embedded Containerd
- K3S Configs: None
We're also seeing this on Ubuntu 20.04 Kernel v5.13.0-37 from the HWE line. Are we potentially looking into this being a K3s<>Ubuntu combo issue?
We're also seeing this on Ubuntu 20.04 Kernel v5.13.0-37 from the HWE line. Are we potentially looking into this being a K3s<>Ubuntu combo issue?
That is doubtful, I am seeing the issue using Fedora Server as well.
I have been able to reproduce this on a node with more cores than my usual dev host and testing instances have, which suggests there may be some sort of race condition. It looks like something is occasionally consuming the first byte of the proxied TLS handshake, which of course breaks things. It feels a lot like https://github.com/golang/go/issues/27408 which had some fixes merged a while back but is still open?
Oh that's some deep level goodies nerdy stuff. However, that issue is from the wayback machine AKA year 2018. Of course it's hard to say - but what is the chances of that getting solved? But, yes - what go#27408 points to sounds quite interesting.