metrics-server
metrics-server copied to clipboard
Failed to scrape node" err="Get \"https://10.100.93.58:10250/metrics/resource\": context deadline exceeded"
What happened: Looks like pods are not scaling based on the load which is causing the pods to restart
What you expected to happen: HPA should scale based on the load Anything else we need to know?:
Environment: Running on EKS 1.27 metrics-server 0.6.3
-
Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):
-
Container Network Setup (flannel, calico, etc.):
-
Kubernetes version (use
kubectl version): -
Metrics Server manifest
spoiler for Metrics Server manifest:
Using helm chart
- Kubelet config:
spoiler for Kubelet config:
- Metrics server logs:
spoiler for Metrics Server logs:
I1025 19:47:59.616036 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file E1025 19:48:28.004348 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.55.152:10250/metrics/resource": context deadline exceeded" node="ip-10-100-55-152.ca-central-1.compute.internal" E1025 19:48:58.004680 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource": dial tcp 10.100.48.155:10250: i/o timeout" node="ip-10-100-48-155.ca-central-1.compute.internal" E1025 19:49:13.005190 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource": context deadline exceeded" node="ip-10-100-48-155.ca-central-1.compute.internal" E1025 19:49:28.003975 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.48.155:10250/metrics/resource": context deadline exceeded" node="ip-10-100-48-155.ca-central-1.compute.internal" E1025 19:53:29.599618 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.78.163:10250/metrics/resource": remote error: tls: internal error" node="ip-10-100-78-163.ca-central-1.compute.internal" E1025 19:54:44.588439 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.50.210:10250/metrics/resource": remote error: tls: internal error" node="ip-10-100-50-210.ca-central-1.compute.internal" E1025 19:55:28.004773 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.73.41:10250/metrics/resource": context deadline exceeded" node="ip-10-100-73-41.ca-central-1.compute.internal"
- Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io
kubectl describe apiservices v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/instance=metrics-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=metrics-server
app.kubernetes.io/version=0.6.3
helm.sh/chart=metrics-server-3.10.0
Annotations: meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2023-07-03T08:59:27Z
Resource Version: 82108513
UID: de273b86-9ba6-4d8d-929c-b972d87717e1
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2023-10-25T19:48:23Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events:
/kind bug
Are you using any CNI plugin (calico, weave, vpc-cni, etc.)? If so, setting hostnetwork: true in your deployment might help.
Yes using aws VPC-cni. Let me try the setting up hostnetwork: true in deployment of metrics server and update the observation here.
@MahiraTechnology Let me know how it goes - we're on EKS and noticed in newer versions of the vpc-cni plugin, there was a communication breakdown somewhere between the pod VPC (where metrics-server runs), the node VPC, and the control plane. After setting hostnetwork: true, everything worked A-Ok.
@brosef i tried to deploy the Metrics server with hostnetwork: true, starting seeing below issue.
panic: failed to create listener: failed to listen on 0.0.0.0:10250: listen tcp 0.0.0.0:10250: bind: address already in use goroutine 1 [running]: main.main() /go/src/sigs.k8s.io/metrics-server/cmd/metrics-server/metrics-server.go:37 +0xa5
you probably have to change the port to something else, 10250 will clash with the kubelet API port. try setting containerPort: 4443
@brosef i deployed with port 4443 still i am seeing the same issue in metrics server pod.
I1026 18:09:39.580732 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key) I1026 18:09:40.087766 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController I1026 18:09:40.087788 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController I1026 18:09:40.087790 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" I1026 18:09:40.087800 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I1026 18:09:40.087777 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" I1026 18:09:40.087969 1 secure_serving.go:267] Serving securely on [::]:4443 I1026 18:09:40.088007 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" I1026 18:09:40.087979 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key" I1026 18:09:40.087993 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file W1026 18:09:40.088063 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed I1026 18:09:40.188199 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController I1026 18:09:40.188217 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I1026 18:09:40.188231 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file E1026 18:23:38.585351 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.100.69.44:10250/metrics/resource": context deadline exceeded" node="ip-10-100-69-44.ca-central-1.compute.internal"
check your security group and firewall rules. ensure tcp 10250 is open between nodes
@brosef i see below error in events
network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
that could mean one of many things. try running through this article: https://repost.aws/knowledge-center/eks-cni-plugin-troubleshooting
@brosef i went through above shared link , everthing looks ok. I see below error msg on HPA
-failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready) -invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready) -failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API -failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Metrics server continue to print the same logs as before.
/assign @CatherineF-dev @dgrisonnet /triage accepted
Connection to node hostname or IP from within the metrics-server Pod is a problem to me. I'm facing the same issue when using flannel.
@MahiraTechnology After opening the port 10250 and 443 on Node SG with the source range of VPC the issue fixed
Same as @vgokul984
I have containerPort: 4443 and hostNetwork: enabled in the values.yaml
Security Group on Node open 10250, both inbound and outbound solve an issue:
kubectl top nodes
ip-10-144-0-146.eu-west-1.compute.internal 45m 2% 2256Mi 31%
ip-10-144-0-17.eu-west-1.compute.internal 108m 5% 3244Mi 45%
ip-10-144-0-97.eu-west-1.compute.internal 78m 4% 3510Mi 49%