vcluster
vcluster copied to clipboard
Pods not syncing quickly: pod-syncer Error updating pod: context deadline exceeded
What happened?
I'm experiencing this rather strange bug where some pods are reporting to be stuck in the "Init: 0/1" status, but only when connected to the vcluster context. When connected to the host cluster context, the pod status reports correctly, however, the pod status in the vcluster occasionally is stuck in either "PendingCreate" or the "Init: 0/1" status, which causes downstream issues with my helm chart installation flow. The pod events while connected to the vcluster context report the following:
Warning SyncError 13m pod-syncer Error updating pod: context deadline exceeded
The below image shows the bug in action with two panes. The left pane is K9s connected to the vcluster context, and the right connected to the host cluster context. As you can see, the pods in the host cluster are "Running", but the same pods in the vcluster context are suck in the "Init:0/1" status.
Looking at the pod events, I see the following:
The pod-syncer error only appears when connected to the vcluster.
The only "error" that I noticed in the vcluster syncer logs is:
filters/wrap.go:54 timeout or abort while handling: method=GET URI="/api/v1/namespaces/xrdm/pods/xxxx-portal-worker-66bc6969b-r4qqr/log?container=xxxx-portal-worker&follow=true&tailLines=100×tamps=true" audit-ID="1d310b7a-7010-4c0e-a116-7b4127d94193"
What did you expect to happen?
I expect the pods status while connected to the vcluster context to reflect the correct status.
How can we reproduce it (as minimally and precisely as possible)?
- Install vcluster version v0.20.0-beta.5 using the helm chart.
- Connect to the vcluster and create a bunch of containers
- Observe that the pod-syncer is not working as intended
Anything else we need to know?
The pods while connected to the vcluster context will eventually report the correct status, but sometimes it takes 5 or 10 minutes before it does.
Host cluster Kubernetes version
$ kubectl version
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0
Host cluster Kubernetes distribution
GKE-1.29.0
vlcuster version
v0.20.0-beta.5
Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)
k8s
OS and Arch
OS: GKE containerd image
Arch:
Let me summarise: the pods get scheduled on the host, but the vcluster api server doesn’t reflect the status correctly?
Do you observe significant load (i.e. request latency, total requests increased) on the host api server when this happens?
Looks similar to https://github.com/loft-sh/vcluster/issues/1589 to me
@heiko-braun
Let me summarise: the pods get scheduled on the host, but the vcluster api server doesn’t reflect the status correctly?
That's correct!
Do you observe significant load (i.e. request latency, total requests increased) on the host api server when this happens?
I do not. The vcluster pods on the host cluster are given a good amount of resources. 3 vCPUs, 4Gi of memory. Usually the node this vcluster pod is on is no where near these limits.
@everflux
Looks similar to https://github.com/loft-sh/vcluster/issues/1589 to me
Yes, indeed! I suppose this issue can be marked as a duplicate. Thanks for catching that!
@alextricity25 would you mind trying virtual Kubernetes version v1.29.4 as there was a pretty significant bug in v1.29.0 that caused issues (https://github.com/kubernetes/kubernetes/issues/123448), which could be the problem for this
@alextricity25 would you mind trying virtual Kubernetes version v1.29.4 as there was a pretty significant bug in v1.29.0 that caused issues (kubernetes/kubernetes#123448), which could be the problem for this
@FabianKramm The default is v1.29.0. I'll override this to 1.29.4 to see if that does anything. It's difficult to iterate and test to see if this certain changes fixes this issue, because it doesn't happen all the time. I'll let you know!
@FabianKramm I tested 1.29.4 (Server Version: v1.29.4+k3s1, I think embedded db/sqlite) with helm chart v0.20.0-beta.5 and still observed the issue once the metrics server was present in the host cluster. (host is 1.18.2)
@FabianKramm I also observed the issue again on 1.29.5. Screen shot below. The pods in the vCluster context eventually did report the correct status, but it took about 1-2 minutes before they did.
Interesting side note, when this happens, restarting the vcluster pod running the syncer
service gets things running again.
@alextricity25 would you mind sharing the logs? Are you using NFS backing store by any chance?
@alextricity25 is there a way for you to collect metrics from the vCluster itself? Basically what we are looking for is the range(apiserver_watch_cache_events_dispatched_total[1m])
metric, because that will tell us if vCluster has received any events if greater zero when this happens (which would indicate its a problem on our syncer side) or if the metric is zero / close to zero this would indicate its a problem within Kubernetes or the backing store itself.
@alextricity25 I ran a test in GKE v1.30.1 with vCluster v0.20.0-beta.10 with the following vcluster.yaml
:
controlPlane:
distro:
k8s:
apiServer:
image:
tag: v1.29.5
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 2
memory: 4Gi
securityContext:
seccompProfile:
type: Unconfined
hostPathMapper:
enabled: true
statefulSet:
resources:
limits:
cpu: "3"
ephemeral-storage: 8Gi
memory: 10Gi
requests:
cpu: "3"
ephemeral-storage: 8Gi
memory: 10Gi
sync:
fromHost:
nodes:
enabled: true
syncBackChanges: true
toHost:
persistentVolumeClaims:
enabled: true
serviceAccounts:
enabled: true
And then deployed a deployment.yaml
with:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
replicas: 200
selector:
matchLabels:
app: test
template:
metadata:
annotations:
my-annotation: my-annotation
labels:
app: test
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
And then changed the deployment a couple of times, but it never took longer than a minute to get all those 200 pods to Running. Do you have any special configuration for GKE or are using a special deployment? How many pods are we talking about and how much load is the vCluster under?
@alextricity25 good news, I think we found the issue, its related to us using an older kine version, the linked PR should fix this and will be part of the next release.
@FabianKramm Thanks for all of your investigation on this issue. I really appreciate it and it's not gone unnoticed.
About a week ago, we switched our distro and version from k8s v1.29.5
to k3s v1.30.2-rc3-k3s1
. Ever since then, we haven't been observing these timeouts or latency issues between the host pods and the vcluster pods. Actually, things have been running pretty smoothly. Do you think the version bump to v1.30.2-rc3-k3s1
may be related to the fix in your findings, or it could be something completely different?
In regards to your requests, it's possible that I can switch back to the k8s distro with the previous version we had defined to see if I can reproduce the problem and gather the logs and metrics you are asking for. We are not running an NFS backing store. We are running a pretty standard Dataplane v2 GKE cluster and using the vcluster helm chart to install vcluster with the values I gave above. This will have to wait until next week for me, though.
Another interesting thing to note is that we only saw this problem when a node with the vcluster syncer service was running for a few days. Restarting the node would temporarily fix the issue, but after a while any other vcluster syncer services that landed on that node would be impacted by this issue.
@alextricity25 thanks for the response! The problem we identified lies within Kine up to version v0.11.4, which had a bug that caused this issue where watch requests from the vCluster syncer were starved and vCluster wouldn't get any notifications anymore, which lead to the pod statuses not being updated in time. Restarting the vCluster worked because Kine was restarted and the watch request was reestablished.
Upgrading to k3s version v1.30.2-rc3-k3s1
fixes the problem (as well as using etcd as a backend instead of Kine) because they have upgraded the internally used Kine version there to v0.11.9 as well. Using older k3s versions that use a Kine version below v0.11.4 will still have that problem.
Since k8s & vCluster v0.20 was using Kine version v0.11.1, this bug was also present there and caused the symptoms you saw. To fix this, we upgraded Kine to v0.11.10 in the PR #1894, which is now part of the new beta release v0.20.0-beta.11 that should not have this issue anymore. If you find time, would be nice if you could validate that, but we are very confident that this caused your issue and is now fixed.
Thank you so much for your work this hopefully solves a really annoying issue. I would love to help validating the fix! But I have no clue how to trigger it. I used shell scripts creating hundreds of pods, deleting them, creating new ones, all worked like a charm. Until a handful of real users tried basically the same stuff and the issue appeared. so any hints how to force it so I can give it a run?
@everflux I'm not sure how to trigger it, within my testing I could only reproduce it once yesterday with Kine v0.11.1, it really occurs randomly as I tried to do the exact steps again and didn't had the issue again with the same configuration.
But we have a customer who told us basically the same thing as @alextricity25, that they tried many different k3s versions, the newer ones all worked, while the older ones all had that problem sooner or later, so I compared the Kine versions there and found out that all versions that used up until v0.11.3 had that issue, while all newer Kine versions seem to work fine.