What happened: Looks like pods are not scaling based on the load which is causing the pods to restart

What you expected to happen: HPA should scale based on the load Anything else we need to know?:

Environment: Running on EKS 1.27 metrics-server 0.6.3

Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):
Container Network Setup (flannel, calico, etc.):
Kubernetes version (use kubectl version):
Metrics Server manifest

spoiler for Metrics Server manifest:

Using helm chart

Kubelet config:

spoiler for Kubelet config:

Metrics server logs:

spoiler for Metrics Server logs:

I1025 19:47:59.616036 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file E1025 19:48:28.004348 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.55.152:10250/metrics/resource": context deadline exceeded" node="ip-10-100-55-152.ca-central-1.compute.internal" E1025 19:48:58.004680 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource": dial tcp 10.100.48.155:10250: i/o timeout" node="ip-10-100-48-155.ca-central-1.compute.internal" E1025 19:49:13.005190 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource": context deadline exceeded" node="ip-10-100-48-155.ca-central-1.compute.internal" E1025 19:49:28.003975 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource": context deadline exceeded" node="ip-10-100-48-155.ca-central-1.compute.internal" E1025 19:53:29.599618 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.78.163:10250/metrics/resource": remote error: tls: internal error" node="ip-10-100-78-163.ca-central-1.compute.internal" E1025 19:54:44.588439 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.50.210:10250/metrics/resource": remote error: tls: internal error" node="ip-10-100-50-210.ca-central-1.compute.internal" E1025 19:55:28.004773 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.73.41:10250/metrics/resource": context deadline exceeded" node="ip-10-100-73-41.ca-central-1.compute.internal"

Status of Metrics API:

spolier for Status of Metrics API:

kubectl describe apiservice v1beta1.metrics.k8s.io

kubectl describe apiservices v1beta1.metrics.k8s.io Name: v1beta1.metrics.k8s.io Namespace:
Labels: app.kubernetes.io/instance=metrics-server app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=metrics-server app.kubernetes.io/version=0.6.3 helm.sh/chart=metrics-server-3.10.0 Annotations: meta.helm.sh/release-name: metrics-server meta.helm.sh/release-namespace: kube-system API Version: apiregistration.k8s.io/v1 Kind: APIService Metadata: Creation Timestamp: 2023-07-03T08:59:27Z Resource Version: 82108513 UID: de273b86-9ba6-4d8d-929c-b972d87717e1 Spec: Group: metrics.k8s.io Group Priority Minimum: 100 Insecure Skip TLS Verify: true Service: Name: metrics-server Namespace: kube-system Port: 443 Version: v1beta1 Version Priority: 100 Status: Conditions: Last Transition Time: 2023-10-25T19:48:23Z Message: all checks passed Reason: Passed Status: True Type: Available Events:

/kind bug

Oct 25 '23 21:10 MahiraTechnology

Are you using any CNI plugin (calico, weave, vpc-cni, etc.)? If so, setting hostnetwork: true in your deployment might help.

Oct 25 '23 22:10 brosef

Yes using aws VPC-cni. Let me try the setting up hostnetwork: true in deployment of metrics server and update the observation here.

Oct 25 '23 23:10 MahiraTechnology

@MahiraTechnology Let me know how it goes - we're on EKS and noticed in newer versions of the vpc-cni plugin, there was a communication breakdown somewhere between the pod VPC (where metrics-server runs), the node VPC, and the control plane. After setting hostnetwork: true, everything worked A-Ok.

Oct 25 '23 23:10 brosef

@brosef i tried to deploy the Metrics server with hostnetwork: true, starting seeing below issue.

panic: failed to create listener: failed to listen on 0.0.0.0:10250: listen tcp 0.0.0.0:10250: bind: address already in use goroutine 1 [running]: main.main() /go/src/sigs.k8s.io/metrics-server/cmd/metrics-server/metrics-server.go:37 +0xa5

Oct 26 '23 17:10 MahiraTechnology

you probably have to change the port to something else, 10250 will clash with the kubelet API port. try setting containerPort: 4443

Oct 26 '23 18:10 brosef

@brosef i deployed with port 4443 still i am seeing the same issue in metrics server pod.

I1026 18:09:39.580732 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key) I1026 18:09:40.087766 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController I1026 18:09:40.087788 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController I1026 18:09:40.087790 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" I1026 18:09:40.087800 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I1026 18:09:40.087777 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" I1026 18:09:40.087969 1 secure_serving.go:267] Serving securely on [::]:4443 I1026 18:09:40.088007 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" I1026 18:09:40.087979 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key" I1026 18:09:40.087993 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file W1026 18:09:40.088063 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed I1026 18:09:40.188199 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController I1026 18:09:40.188217 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I1026 18:09:40.188231 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file E1026 18:23:38.585351 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.69.44:10250/metrics/resource": context deadline exceeded" node="ip-10-100-69-44.ca-central-1.compute.internal"

Oct 26 '23 18:10 MahiraTechnology

check your security group and firewall rules. ensure tcp 10250 is open between nodes

Oct 26 '23 18:10 brosef

@brosef i see below error in events

network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Oct 26 '23 18:10 MahiraTechnology

that could mean one of many things. try running through this article: https://repost.aws/knowledge-center/eks-cni-plugin-troubleshooting

Oct 26 '23 18:10 brosef

@brosef i went through above shared link , everthing looks ok. I see below error msg on HPA

-failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready) -invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready) -failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API -failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

Metrics server continue to print the same logs as before.

Oct 26 '23 20:10 MahiraTechnology

/assign @CatherineF-dev @dgrisonnet /triage accepted

Nov 02 '23 16:11 dashpole

Connection to node hostname or IP from within the metrics-server Pod is a problem to me. I'm facing the same issue when using flannel.

Nov 05 '23 11:11 tengqm

@MahiraTechnology After opening the port 10250 and 443 on Node SG with the source range of VPC the issue fixed

Nov 09 '23 10:11 vgokul984

Same as @vgokul984 I have containerPort: 4443 and hostNetwork: enabled in the values.yaml Security Group on Node open 10250, both inbound and outbound solve an issue:

kubectl top nodes ip-10-144-0-146.eu-west-1.compute.internal 45m 2% 2256Mi 31%
ip-10-144-0-17.eu-west-1.compute.internal 108m 5% 3244Mi 45%
ip-10-144-0-97.eu-west-1.compute.internal 78m 4% 3510Mi 49%

Mar 02 '24 11:03 brankodjurkic

metrics-server
metrics-server copied to clipboard

Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded"

@brosef i deployed with port 4443 still i am seeing the same issue in metrics server pod.

@brosef i see below error in events

metrics-server metrics-server copied to clipboard

Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded"

@brosef i deployed with port 4443 still i am seeing the same issue in metrics server pod.

@brosef i see below error in events

metrics-server
metrics-server copied to clipboard