calico icon indicating copy to clipboard operation
calico copied to clipboard

[Windows] Issue with ReadinessProbe suddenly failing on all pods of one node

Open eyenx opened this issue 3 years ago • 13 comments

We have the problem, that on our Windows 2019 Datacenter Servers several times a day all pods of a node show as failing the ReadinessProbe.

This fixes itself after 5-10 minutes, but continues to happen over and over.

All our windows nodes on all our clusters have this problem. No linux nodes have the same behaviour.

We see the Pods becomeing 0/1 suddenly. The Node is then unreachable via SSH, but can be pinged and a kubectl exec on a specific Pod running on that nod is working without issues.

After looking for a few hours we could not find the logs to troubleshoot this further.

Expected Behavior

ReadinessProbes do not fail on Windows nodes without any reason.

Current Behavior

ReadinessProbes fail several times a day on all pods of a single Windows node, but fix themselves after a few minutes.

Possible Solution

Steps to Reproduce (for bugs)

Can't be actively reproduced. But happens quite often.

Context

Running Windows Pods on Windows Nodes with ReadinessProbe

Your Environment

  • Calico version 3.20.3 on WIndows (Same version on Linux Nodes)
  • Orchestrator version (e.g. kubernetes, mesos, rkt): k8s v1.22.8
  • Operating System and version: Windows Server 2019 Datacenter 10.0.17763.2305 / ContainerD 1.5.8

Attached several outputs to help out (kube-proxy & kubelet error log from respective node & some output from kubectl and directly from the Node).

errors.txt kube-proxy.err.log kube-proxy.err-20220607T134535.340.log kubelet.err.log

eyenx avatar Jun 07 '22 14:06 eyenx

@song-jiang @lmm

caseydavenport avatar Jun 09 '22 10:06 caseydavenport

@eyenx According to errors.txt (see below), who owns the IP:Port of 10.147.49.49:80?

$ kubectl get events -n app-int-app5
LAST SEEN   TYPE      REASON      OBJECT                    MESSAGE
42m         Warning   Unhealthy   pod/app5-cc5b588cb-t67xw   Readiness probe failed: Get "http://10.147.49.49:80/server/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

song-jiang avatar Jun 09 '22 12:06 song-jiang

Hi @song-jiang

That is the Pod IP. But it is reachable. somehow calico networking on the node is failing, and that makes the readinessprobe fail of course.

eyenx avatar Jun 09 '22 15:06 eyenx

But it is reachable

Do you mean when you try to access 10.147.49.49:80 from the Windows node which hosts the pod by yourself, it works fine?

song-jiang avatar Jun 09 '22 15:06 song-jiang

this I cannot check as SSH is not working on theu node at that time. but kubectl exec works and the Pod is reachable. As mentioned above this happens for 5-10 minutes on all pod of a node at the same time. after that timeframe the pods go back to being healthy and the Node is again reachable via ssh.

I am assuming something with calico to be not okay.

eyenx avatar Jun 09 '22 16:06 eyenx

It is rare Calico can affect traffic in this pattern. Normally if there is an issue with Calico, either the traffic is allowed or denied.

song-jiang avatar Jun 10 '22 09:06 song-jiang

We managed to get a local Admin Account so we could access the node during that timeframe without SSH. Curl to the Pod gives back a timeout.

So you suggest Calico can't be the culprit in this case?

eyenx avatar Jun 10 '22 13:06 eyenx

I'm not sure. I'm just a bit confused about local node to pod access (readiness check) get blocked but fixed itself after a few minutes. Normally if it is a Calico issue, you will see the traffic been blocked for good.

If it is easy to repo, can you collect the logs from c:\CalicoWindows\logs directory? Just like to see if Calico services restarted or something.

song-jiang avatar Jun 10 '22 14:06 song-jiang

What's the interval for this happening? Sometimes that can hint at the cause.

Without knowing the interval, I'd be guessing things like DHCP renewal. Check windows logs to find out what's going on around the time it happens.

lwr20 avatar Jun 14 '22 16:06 lwr20

Hi,

@song-jiang i am working with @eyenx at this issue. attached you can find the logs from c:\CalicoWindows\logs directory.

calico-felix-20220613T211108.090.log calico-node.err.log calico-node.log calico-felix.err.log calico-felix.log calico-felix-20220513T164524.844.log calico-felix-20220513T164524.846.log calico-felix-20220601T194316.181.log

The issue happend on this node today (16 June) arround 12:55. The logs are from arround 12:58.

@lwr20 - i can't tell the interval, but its quite often (i would guess every 2-3 hours.). DHCP is not the case - as ping is possible during the time of the issue and the server has a fixxed ip address, no dhcp.

sil-zh avatar Jun 16 '22 11:06 sil-zh

Common things that happen in intervals around that sort of length:

  • DHCP renewal
  • DNS timeout
  • Kubernetes API watch timeout

Not saying its one of those for sure, just suggesting things to look for

lwr20 avatar Jun 16 '22 13:06 lwr20

I don't see anything interest in the log which would cause the issue around 12:55.

song-jiang avatar Jun 16 '22 13:06 song-jiang

Some more informations about this issue - as it is still open.

As told before, no ssh connection is possible during the time of the issue. But port 22 is always open (tested with nc). Also logged in connections via SSH are working. So i was logged in on one server and had a loop running. the loop curled to a website in the internet and did a ping.

when the issue came up, ping was working while curl faiuled (unable to connect to the remote server).

during the time of the issue, i could see that the node sends packages to the pods (i guess readynessprobe). in netstat i could see (for all the pods):

10.15.19.117:65103 10.147.127.34:80 SYN_SENT

i dont know if it helps, but at least some more informations

sil-zh avatar Jul 05 '22 11:07 sil-zh