AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] Cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete

Open arsnyder16 opened this issue 2 years ago • 100 comments

Describe the bug Requests from cluster workloads to the kubernetes API server will intermittently timeout or takes minutes to complete, depending on the workloads request settings.

To Reproduce Steps to reproduce the behavior: Provision a new SLA enabled cluster

rg=
acr=
aks=
workspaceId=
location=northcentralus
aksVersion=1.22.6
az aks create \
  --resource-group $rg \
  --name $aks \
  --vm-set-type VirtualMachineScaleSets \
  --node-count 2 \
  --node-vm-size Standard_DS2_v2 \
  --node-osdisk-size 64 \
  --node-osdisk-type Ephemeral \
  --generate-ssh-keys \
  --kubernetes-version $aksVersion \
  --attach-acr $acr \
  --load-balancer-sku standard \
  --location $location \
  --enable-managed-identity \
  --uptime-sla \
  --enable-encryption-at-host \   
  --enable-addons monitoring \
  --workspace-resource-id $workspaceId

This may be optional but might help produce the problem. Install nginx ingress, and you can just leave the replicas as 0 to avoid adding any more noise to the cluster

helm upgrade --install nginx ingress-nginx/ingress-nginx \
  --create-namespace \
  --namespace ingress \
  --set controller.replicaCount=0 \
  --set controller.service.externalTrafficPolicy=Local \
  --wait

deploy a simple workload that just used kubectl to list the pods in a namespace, these jobs will fail once they detect the issue.

kubectl create namespace api-issue
kubectl apply -f - << EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: slowrequest
  namespace: api-issue
spec:
  parallelism: 5
  completions: 5
  backoffLimit: 2
  template:  
    metadata:
      labels:
        app: slowrequest
    spec:
      restartPolicy: Never
      containers:
      - name: slowrequest
        image: bitnami/kubectl
        imagePullPolicy: IfNotPresent   
        command: 
          - /bin/sh 
        args: 
          - -c
          - set -e; while true; do kubectl get pods -n=default --selector=app=my-api --v=9 --output=json; sleep 260s; done
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: internal-kubectl-role
  namespace: api-issue
rules:
- apiGroups: [""]
  resources: ["pods", "pods/status"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: internal-kubectl-role-binding
subjects:
- kind: ServiceAccount
  name: default
  namespace: api-issue
roleRef:
  kind: ClusterRole
  name: internal-kubectl-role
  apiGroup: rbac.authorization.k8s.io
EOF

With the kubectl example above it will manifest in a timeout trying to do TLS handshake. What is strange about the kubectl log output is it does seem to have the response body, but it is show as a header.

Here is an example of succesful run

I0503 15:02:34.479797       7 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.23.6 (linux/amd64) kubernetes/ad33385" -H "Authorization: Bearer <masked>" 'https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500'
I0503 15:02:34.504364       7 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 24 ms Duration 24 ms
I0503 15:02:34.504383       7 round_trippers.go:577] Response Headers:
I0503 15:02:34.504390       7 round_trippers.go:580]     Date: Tue, 03 May 2022 15:02:34 GMT
I0503 15:02:34.508215       7 round_trippers.go:580]     Audit-Id: 0a8166f8-d538-4b02-8e1b-4fd050760999
I0503 15:02:34.508239       7 round_trippers.go:580]     Cache-Control: no-cache, private
I0503 15:02:34.508298       7 round_trippers.go:580]     Content-Type: application/json
I0503 15:02:34.508314       7 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: fae6b7ca-e682-4669-94eb-85163e201928
I0503 15:02:34.508319       7 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: 2a307f0a-f367-4f1d-ba5c-1bc4c330d0f1
I0503 15:02:34.508325       7 round_trippers.go:580]     Content-Length: 91
I0503 15:02:34.508374       7 request.go:1181] Response Body: {"kind":"PodList","apiVersion":"v1","metadata":{"resourceVersion":"251974943"},"items":[]}
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

Here is an example of one that fails just a few minutes later

I0503 15:06:54.560888      14 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.23.6 (linux/amd64) kubernetes/ad33385" -H "Authorization: Bearer <masked>" 'https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500'
I0503 15:06:54.562933      14 round_trippers.go:510] HTTP Trace: Dial to tcp:10.0.0.1:443 succeed
I0503 15:07:04.564088      14 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 1 ms TLSHandshake 10000 ms Duration 10003 ms
I0503 15:07:04.564131      14 round_trippers.go:577] Response Headers:
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}
I0503 15:07:04.565727      14 helpers.go:237] Connection error: Get https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500: net/http: TLS handshake timeout
F0503 15:07:04.566119      14 helpers.go:118] Unable to connect to the server: net/http: TLS handshake timeout
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x3080040, 0x3, 0x0, 0xc000620000, 0x2, {0x25f2ec7, 0x10}, 0xc00005c400, 0x0)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0xc0002e1a40, 0x40, 0x0, {0x0, 0x0}, 0x54, {0xc00032b470, 0x1, 0x1})
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:735 +0x1ae
k8s.io/kubernetes/vendor/k8s.io/klog/v2.FatalDepth(...)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1518
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.fatal({0xc0002e1a40, 0x40}, 0xc0007664e0)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:96 +0xc5
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.checkErr({0x1fed760, 0xc0007664e0}, 0x1e797d0)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:191 +0x7d7
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util.CheckErr(...)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:118
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/get.NewCmdGet.func2(0xc0003e8780, {0xc000494460, 0x1, 0x5})
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/get/get.go:181 +0xc8
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc0003e8780, {0xc000494410, 0x5, 0x5})
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:860 +0x5f8
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00056cf00)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/vendor/k8s.io/component-base/cli.run(0xc00056cf00)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/cli/run.go:146 +0x325
k8s.io/kubernetes/vendor/k8s.io/component-base/cli.RunNoErrOutput(...)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/component-base/cli/run.go:84
main.main()
	_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubectl/kubectl.go:30 +0x1e

I experience different behaviors with different clients. For example I have a simple nodejs app that does the same thing by just listing the pods through the k8s sdk. In this environment i will get situations where the requests will take upwards of 5 minutes to complete

'use strict';
const process = require('process');
const k8s = require('@kubernetes/client-node');
const kc = new k8s.KubeConfig();
kc.loadFromCluster();
const k8Core = kc.makeApiClient(k8s.CoreV1Api);
let startTime; // custom log offset to help correlate with tcp dump
const log = msg => console.log(`${(new Date() - startTime) / 1000.0} ${msg}`);
let running;
let interval;
const listPods = async ()=>{
  if (running) {
    return;
  }
  running = true;
  log('Listing pods...');
  const listStart = new Date();
  const { body: { items } } = await k8Core.listNamespacedPod('default');
  const seconds = (new Date() - listStart) / 1000.0;
  log(`Found ${items.length} pods in ${seconds} seconds`);
  if(seconds > 60) {
    log(`Closing because this seems excessive`);
    process.exitCode = -1;
    clearInterval(interval);
    return;
  }
  running = false;
};
setTimeout(()=>{
  startTime = new Date();
  listPods();
  interval = setInterval(listPods, 215 * 1000);
}, 1000)

2022-06-28T02:04:30.490394199Z 36564.257 Listing pods...
2022-06-28T02:04:30.525501046Z 36564.292 Found 0 pods in 0.035 seconds
2022-06-28T02:08:05.550325353Z 36779.317 Listing pods...
2022-06-28T02:13:37.847998556Z 37111.614 Found 0 pods in 332.297 seconds  <--- Not great
2022-06-28T02:13:37.848028156Z 37111.614 Closing because this seems excessive

Expected behavior Requests should be able to complete in a reasonable amount of time. I see this across many clusters some times every few minutes. To eliminate all cluster specific variables, this is a bare bones replication of the issue, so should not suffer from user workloads effecting performance.

Environment (please complete the following information):

  • CLI Version [e.g. 3.22]
  • Kubernetes version >1.21.x

Additional context This seemed to start once we upgraded clusters from 1.20 to 1.21. I first opened a request ticket with support in January, but it has since been in the support death spiral and has gotten no where an yet to reached a team able to diagnosis or even attempt to reproduce with the simple steps above. I have sent tcpdumps, kubelet logs, etc

This is not specific to any requests we see it across many different requests. We have various workloads that may monitor the cluster using the API or dynamically create or modify workloads through the API.

Have yet been able to reproduce outside of the cluster seems to be very specific to cluster to control plane communication

This only seems to be a problem on SLA enabled clusters. Openvpn, aks-link issue? I don't see any recycling of aks-link or anything useful in the logs.

I am really curious if Konnectivity resolves the problem, but i have yet to see it make to any of my various clusters which are across many different data centers.

arsnyder16 avatar Jun 30 '22 13:06 arsnyder16

I am able to reproduce this issue with 1.23.8 that uses Konnectivity as well

arsnyder16 avatar Jul 24 '22 17:07 arsnyder16

After some more investigation, nginx ingress is not actually required.

I have added workloads and a repro.sh script to the following repo https://github.com/arsnyder16/aks-api-issue

In summary,

  • Create a fresh cluster
  • Add workloads that simply list pods in the cluster
    • One is just using kubectl
    • One is using the node k8s client sdk

After this there seems to be no issue after running these workloads for a day or more

To start having issues with the API REST calls all that needs added to the cluster is a Service of type LoadBalancer, it doesn't need to point any actual backend pods. After adding the service the original workloads will start having intermittent failures with the calls to the API server.

arsnyder16 avatar Jul 27 '22 13:07 arsnyder16

Action required from @Azure/aks-pm

ghost avatar Aug 27 '22 01:08 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Sep 11 '22 06:09 ghost

@arsnyder16 It would be useful if node-side issues could be eliminated as a possible cause. I notice that your test cluster has only two nodes, and they have only 2 cores each. That makes me wonder if those nodes are getting overloaded (presumably with kube-system stuff) and if that node-side overload is triggering the problems.

Would you be able to do the same test, but with

  • a system node pool containing 3 nodes with at least 4 cores each
  • your test workload running in a separate node pool?

Docs on the above can be found here, in case you want to refer to them: https://learn.microsoft.com/en-us/azure/aks/use-system-pools

JohnRusk avatar Sep 21 '22 19:09 JohnRusk

@JohnRusk Hey John, my repro steps are certainly bare minimum and thats why there are only 2 nodes, but we see this issue many different productions clusters as well that have more nodes. So unfortunately node count does not seem related.

arsnyder16 avatar Sep 21 '22 19:09 arsnyder16

@JohnRusk I have should also mention that i worked Microsoft support at one point and we tried with 4 core machines as well. No luck.

I have yet to try running my workloads in a non system pool though, soi can test that for you

arsnyder16 avatar Sep 21 '22 19:09 arsnyder16

Thanks @arsnyder16 . That test will help us be sure that we're looking at the right things.

JohnRusk avatar Sep 21 '22 19:09 JohnRusk

@arsnyder16 If that test also reproduces the issue, could you please mention that fact here and also email me directly. My GitHub profile says how to construct my email address.

JohnRusk avatar Sep 21 '22 20:09 JohnRusk

@JohnRusk Yea no problem, just setting up everything you mentioned to cover all bases

az aks nodepool add --resource-group $rg --cluster-name $aks \
  --name nodepoola \
  --node-count 3 \
  --mode System \
  --node-osdisk-size 128 \
  --node-osdisk-type Ephemeral \
  --node-vm-size Standard_DS3_v2  \
  --enable-encryption-at-host 

az aks nodepool add --resource-group $rg --cluster-name $aks \
  --name nodepoolb \
  --node-count 3 \
  --mode User \
  --node-osdisk-size 128 \
  --node-osdisk-type Ephemeral \
  --node-vm-size Standard_DS3_v2  \
  --enable-encryption-at-host 

arsnyder16 avatar Sep 21 '22 20:09 arsnyder16

I am still able to reproduce the issue in both node pools as described in previous post. One thing to note is my cluster didn't have nginx ingress installed originally the workload ran fine for 15 hours, within an hour of adding the ingress (with zero replicas) the issue started. This seems to me to point to the public ip being provisioned maybe related to the load balancer for the cluster?

helm upgrade --install nginx ingress-nginx/ingress-nginx \
  --create-namespace \
  --namespace ingress \
  --set controller.replicaCount=0 \
  --set controller.service.externalTrafficPolicy=Local \
  --wait

arsnyder16 avatar Sep 22 '22 12:09 arsnyder16

Interesting... Thanks for your email. I just sent a reply.

JohnRusk avatar Sep 22 '22 20:09 JohnRusk

Progress update: we've identified that the delays are happening during the TLS Handshake.

JohnRusk avatar Sep 24 '22 01:09 JohnRusk

Progress update: we've identified that the delays are happening during the TLS Handshake.

@JohnRusk Could you please share some more information on the issue you have identified? We could be bumping into the same issue in one of our own cluster, where several requests towards the AKS API server fail with a TLS handshake timeout, causing frequent failures to services running in the cluster:

F0929 08:31:27.272615       1 config.go:37] Get "https://10.0.0.1:443/api?timeout=32s": net/http: TLS handshake timeout

Update: We were able to also reproduce the error using the Job manifest provided above, so it seems likely that it is indeed the same issue:

I0929 15:40:10.099624     323 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.25.2 (linux/amd64) kubernetes/5835544" -H "Authorization: Bearer <masked>" '[https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500)'
I0929 15:40:10.101313     323 round_trippers.go:510] HTTP Trace: Dial to tcp:10.0.0.1:443 succeed
I0929 15:40:20.102170     323 round_trippers.go:553] GET [https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500)  in 10002 milliseconds
I0929 15:40:20.102204     323 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 1 ms TLSHandshake 10000 ms Duration 10002 ms
I0929 15:40:20.102212     323 round_trippers.go:577] Response Headers:
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": ""
    }
}
I0929 15:40:20.102511     323 helpers.go:264] Connection error: Get [https://10.0.0.1:443/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500](https://10.0.0.1/api/v1/namespaces/default/pods?labelSelector=app%3Dmy-api&limit=500): net/http: TLS handshake timeout
Unable to connect to the server: net/http: TLS handshake timeout

klolos avatar Sep 29 '22 14:09 klolos

@klolos We have not found the root cause yet.

FYI, in the examples I've looked at, the issue is transient. E.g. if the same operation is retried immediately after the failure it will succeed. I.e. it will sucessully create a new connection, and do a successful handshake. Is that what you see? Or do you see the problem happening over and over again in a short period of time?

JohnRusk avatar Sep 29 '22 20:09 JohnRusk

@JohnRusk Yes, the error seems transient. We are running recurring Kubeflow pipelines (which use Argo workflows), and once every few runs a pipeline step will fail because if this. Subsequent runs may succeed, so the error seems to be happening randomly.

klolos avatar Sep 30 '22 09:09 klolos

Thanks @klolos . Your issue sounds like it could indeed be the same as what we are looking at in this GitHub issue. We have gathered a lot of information in discussion with @arsnyder16. The root cause does not seem to be in the API Server itself (my area of expertise) so I'm reaching out to some colleagues here at AKS. It's a tricky issue, so the work may take some time.

JohnRusk avatar Oct 03 '22 22:10 JohnRusk

Action required from @Azure/aks-pm

ghost avatar Nov 03 '22 01:11 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Nov 18 '22 06:11 ghost

We seem to be facing the same Issue in one of our AKS Clusters in the location westeurope.

@JohnRusk Did you got any more information on this issue?

rbnis avatar Nov 18 '22 07:11 rbnis

@arsnyder16 Thank you for such a nice BUG report

@JohnRusk We are experiencing the same issue: Running within the cluster kubectl sometimes bails out with: Unable to connect to the server: net/http: TLS handshake timeout after 10 seconds API server availability is set to 99.5% There are no logs from API server that indicate any kind of restart. Actually same API server pods are running for 15 days produced more than 30 of the errors

The connection is made to 10.0.0.1 that has a single endpoint - the public IP of the API server. It would be very strange if anything in the cluster would have decisive effect on establishing TLS session with a external public IP. Thus it looks like a problem in the part of AKS API service that connects an api service pods(Run by Azure) with the client (kubeclt)

AKS: 1.24.6 Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"6c23b67c202a4cfa7c76c3e1b370bd5f0e654f30", GitTreeState:"clean", BuildDate:"2022-11-09T17:13:23Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

mllab-nl avatar Nov 29 '22 12:11 mllab-nl

Our networking experts are looking into this. Due to the nature of the issue, it will take a bit more time, I'm sorry to say.

JohnRusk avatar Dec 04 '22 23:12 JohnRusk

I'm experiencing the same issue. A support ticket has been created.

magnus-longva-bouvet avatar Jan 02 '23 09:01 magnus-longva-bouvet

Action required from @Azure/aks-pm

ghost avatar Jan 07 '23 16:01 ghost

Issue needing attention of @Azure/aks-leads

ghost avatar Jan 22 '23 18:01 ghost

@JohnRusk any update on the issue ? We are also experiencing it on a regular basis

Alexander-Bartosh avatar Jan 30 '23 12:01 Alexander-Bartosh

@Alexander-Bartosh Are you timeouts specifically relating to TLS Handshake failures, and are they rare? (E.g. less than 1% of all handshakes fail). If so, then yes, they sound a lot like this issue. If not, they may be something different.

In terms of progress, one of my colleagues has made good progress recently on what has turned out to be an unexpectedly complex issue. I don't have any dates to share at this stage.

JohnRusk avatar Jan 30 '23 20:01 JohnRusk

cc @arsnyder16 ^

JohnRusk avatar Jan 30 '23 20:01 JohnRusk

@JohnRusk Thanks for the update. Just for clarity. It seems as though your colleague has been able to replicate reliability(maybe not frequently) in their environment to work on a potential resolution? If so that is certainly more progress than previously reported and is a good sign.

arsnyder16 avatar Jan 30 '23 20:01 arsnyder16

They have actually set up a repro which is reliable and frequent, by changing a few things, including some internal parameters (basically, forcing the problem to happen). As far as we can tell, your version of the problem appears to be simply a lower frequency version of what my colleague is now able to reproduce. And he has a solution. It needs internal review, implementation and rollout - which will take some time. I don't have any dates to share, I'm sorry.

JohnRusk avatar Jan 30 '23 22:01 JohnRusk