metrics-server icon indicating copy to clipboard operation
metrics-server copied to clipboard

Metrics server not running

Open nabad600 opened this issue 2 years ago • 17 comments

Hi,

I have deployed the metrics server but it's not running image

I have added the below lines after "args" in "components.yaml "

command:
        - /metrics-server
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP

then it worked. Below is the screenshot for reference. image

Can you please add those line. or put the instruction in README.md file.

nabad600 avatar Aug 23 '22 15:08 nabad600

Maybe this is what you need

Kubelet certificate needs to be signed by cluster Certificate Authority (or disable certificate validation by passing --kubelet-insecure-tls to Metrics Server)

https://github.com/kubernetes-sigs/metrics-server/blob/master/README.md#requirements

yangjunmyfm192085 avatar Aug 24 '22 00:08 yangjunmyfm192085

I'm having the same problem, tried to pass --kubelet-insecure-tls in the args but it didn't work. I have also tried using the suggested fix but it didn't work. Log when describing the metrics-server pod (note that the --kubelet-insecure-tls was passed):

Name:                 metrics-server-658867cdb7-9g2px
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 docker-desktop/192.168.65.4
Start Time:           Thu, 25 Aug 2022 21:47:40 -0300
Labels:               k8s-app=metrics-server
                      pod-template-hash=658867cdb7
Annotations:          <none>
Status:               Running
IP:                   10.1.0.78
IPs:
  IP:           10.1.0.78
Controlled By:  ReplicaSet/metrics-server-658867cdb7
Containers:
  metrics-server:
    Container ID:  docker://4c1d8989b9aa61a4d924f8cb6102084e5faadd7d93c661f186dfa3b99721f22e
    Image:         k8s.gcr.io/metrics-server/metrics-server:v0.6.1
    Image ID:      docker-pullable://k8s.gcr.io/metrics-server/metrics-server@sha256:5ddc6458eb95f5c70bd13fdab90cbd7d6ad1066e5b528ad1dcb28b76c5fb2f00
    Port:          4443/TCP
    Host Port:     0/TCP
    Args:
      --cert-dir=/tmp
      --secure-port=4443
      --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
      --kubelet-use-node-status-port
      --metric-resolution=15s
      --kubelet-insecure-tls
    State:          Running
      Started:      Thu, 25 Aug 2022 21:47:41 -0300
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /tmp from tmp-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5mcn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tmp-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-v5mcn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  30s   default-scheduler  Successfully assigned kube-system/metrics-server-658867cdb7-9g2px to docker-desktop
  Normal   Pulled     29s   kubelet            Container image "k8s.gcr.io/metrics-server/metrics-server:v0.6.1" already present on machine
  Normal   Created    29s   kubelet            Created container metrics-server
  Normal   Started    29s   kubelet            Started container metrics-server
  Warning  Unhealthy  0s    kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

Gab-Menezes avatar Aug 26 '22 00:08 Gab-Menezes

Can you provide the logs of metrics-server?

yangjunmyfm192085 avatar Aug 26 '22 00:08 yangjunmyfm192085

kubectl logs -n kube-system pods/metrics-server-658867cdb7-9g2px

I0826 00:47:41.539717       1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0826 00:47:41.898143       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0826 00:47:41.898160       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0826 00:47:41.898160       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0826 00:47:41.898168       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0826 00:47:41.898173       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0826 00:47:41.898181       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0826 00:47:41.898381       1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0826 00:47:41.898427       1 secure_serving.go:266] Serving securely on [::]:4443
I0826 00:47:41.898483       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
W0826 00:47:41.898544       1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I0826 00:47:41.998317       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0826 00:47:41.998329       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0826 00:47:41.998364       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
E0826 00:47:55.396469       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
E0826 00:48:10.396404       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:48:10.428713       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:48:20.429191       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:48:25.396717       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:48:30.430000       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:48:40.396273       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:48:40.429237       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:48:50.428677       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:48:55.396038       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:49:00.429074       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:49:02.239233       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:49:10.397128       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:49:10.429253       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:49:20.429121       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:49:25.396276       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:49:30.428659       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:49:40.396563       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:49:40.429080       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:49:50.428520       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:49:55.396179       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:50:00.429383       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:50:10.395473       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:50:10.429282       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:50:20.429489       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:50:25.396494       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:50:30.428938       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:50:32.239825       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:50:40.396391       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:50:40.428628       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:50:50.430082       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:50:55.396338       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:51:00.427547       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:51:10.395791       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:51:10.428750       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:51:20.428033       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:51:25.395724       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:51:30.428932       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:51:38.240021       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:51:40.396141       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:51:40.429124       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:51:50.428346       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:51:55.396187       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:52:00.429772       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:52:10.396327       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:52:10.428712       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:52:20.428466       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:52:25.397218       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:52:30.429568       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:52:40.396537       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:52:40.429133       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:52:50.429106       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0826 00:52:55.396498       1 scraper.go:140] "Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"
I0826 00:53:00.239568       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 00:53:00.429061       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

Gab-Menezes avatar Aug 26 '22 00:08 Gab-Menezes

Maybe the problem is that the IP 192.168.65.4 is being used, but on windows the internal ip is not used. Instead we use localhost kubectl get nodes -o wide

NAME             STATUS   ROLES           AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION
   CONTAINER-RUNTIME
docker-desktop   Ready    control-plane   23h   v1.24.0   192.168.65.4   <none>        Docker Desktop   5.10.102.1-microsoft-standard-WSL2   docker://20.10.14

Gab-Menezes avatar Aug 26 '22 00:08 Gab-Menezes

"Failed to scrape node" err="Get \"https://192.168.65.4:10250/metrics/resource\": context deadline exceeded" node="docker-desktop"

Accessing the kubelet's /metrics/resource endpoint timed out.

There is a similar issue here, I don't know if it can help you https://github.com/kubernetes-sigs/metrics-server/issues/907

yangjunmyfm192085 avatar Aug 26 '22 02:08 yangjunmyfm192085

Increasing the metric-resolution made the pod ready, but after it executing kubectl get --raw /api/v1/nodes/docker-desktop/proxy/metrics/resource returned: Error from server (NotFound): the server could not find the requested resource. The problem is before changing the value I was able to hit the endpoint (response time of ~20s as described in the issue). The other problem is hpa is complaining.

kubectl get all -n kube-system

NAME                                         READY   STATUS    RESTARTS        AGE
pod/coredns-6d4b75cb6d-8qzvf                 1/1     Running   1 (7h10m ago)   24h
pod/coredns-6d4b75cb6d-km2lg                 1/1     Running   1 (7h10m ago)   24h
pod/etcd-docker-desktop                      1/1     Running   1 (7h10m ago)   24h
pod/kube-apiserver-docker-desktop            1/1     Running   1 (7h10m ago)   24h
pod/kube-controller-manager-docker-desktop   1/1     Running   1 (7h10m ago)   24h
pod/kube-proxy-mw4cv                         1/1     Running   1 (7h10m ago)   24h
pod/kube-scheduler-docker-desktop            1/1     Running   1 (7h10m ago)   24h
pod/metrics-server-769cd769-j4xck            1/1     Running   0               15m
pod/storage-provisioner                      1/1     Running   1 (7h10m ago)   24h
pod/vpnkit-controller                        1/1     Running   45 (9m7s ago)   24h

NAME                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
service/kube-dns         ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   24h
service/metrics-server   ClusterIP   10.110.212.220   <none>        443/TCP                  15m

NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/kube-proxy   1         1         1       1            1           kubernetes.io/os=linux   24h

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/coredns          2/2     2            2           24h
deployment.apps/metrics-server   1/1     1            1           15m

NAME                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/coredns-6d4b75cb6d        2         2         2       24h
replicaset.apps/metrics-server-769cd769   1         1         1       15m

kubectl describe pod -n kube-system metrics-server-769cd769-j4xck

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  12m                default-scheduler  Successfully assigned kube-system/metrics-server-769cd769-j4xck to docker-desktop
  Normal   Pulled     12m                kubelet            Container image "k8s.gcr.io/metrics-server/metrics-server:v0.6.1" already present on machine
  Normal   Created    12m                kubelet            Created container metrics-server
  Normal   Started    12m                kubelet            Started container metrics-server
  Warning  Unhealthy  11m (x6 over 12m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

kubectl logs -n kube-system pods/metrics-server-769cd769-j4xck

I0826 02:25:55.294316       1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0826 02:25:55.636095       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0826 02:25:55.636127       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0826 02:25:55.636138       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0826 02:25:55.636145       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0826 02:25:55.636154       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0826 02:25:55.636146       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0826 02:25:55.636233       1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0826 02:25:55.636274       1 secure_serving.go:266] Serving securely on [::]:4443
W0826 02:25:55.636337       1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I0826 02:25:55.636337       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0826 02:25:55.736986       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0826 02:25:55.737003       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0826 02:25:55.737291       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0826 02:26:24.131175       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 02:26:34.132103       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 02:26:44.131387       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 02:26:54.131096       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 02:27:00.239368       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0826 02:27:04.131346       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

kubectl describe hpa hpa-portal

Name:                                                  hpa-portal
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Thu, 25 Aug 2022 23:29:18 -0300
Reference:                                             Deployment/deployment-portal
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 50%
Min replicas:                                          1
Max replicas:                                          10
Deployment pods:                                       3 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Warning  FailedGetResourceMetric       54s (x11 over 10m)  horizontal-pod-autoscaler  failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  54s (x11 over 10m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API

Gab-Menezes avatar Aug 26 '22 02:08 Gab-Menezes

If executing kubectl get --raw /api/v1/nodes/docker-desktop/proxy/metrics/resource returned: Error from server (NotFound): the server could not find the requested resource, Please check whether the kubelet is running normally, whether there are repeated restarts, etc.

yangjunmyfm192085 avatar Aug 26 '22 05:08 yangjunmyfm192085

/kind support

yangjunmyfm192085 avatar Aug 27 '22 01:08 yangjunmyfm192085

Hi, @Gab-Menezes, Has the issue been solved?

yangjunmyfm192085 avatar Aug 29 '22 04:08 yangjunmyfm192085

Sorry, I forgot about this issue, but no. I can't figure out how to check kubelet status on windows. But I would say that is running normally since I don't have any problem in other pods/applications.

Gab-Menezes avatar Aug 29 '22 17:08 Gab-Menezes

Hey @yangjunmyfm192085, I have the same issue. Getting failed to get cpu utilization: missing request for cpu for hpa.

sk@mac ~ % kubectl version --output=yaml
clientVersion:
  buildDate: "2022-01-19T17:51:12Z"
  compiler: gc
  gitCommit: b631974d68ac5045e076c86a5c66fba6f128dc72
  gitTreeState: clean
  gitVersion: v1.21.9
  goVersion: go1.16.12
  major: "1"
  minor: "21"
  platform: darwin/arm64
serverVersion:
  buildDate: "2022-07-06T18:06:50Z"
  compiler: gc
  gitCommit: ac73613dfd25370c18cbbbc6bfc65449397b35c7
  gitTreeState: clean
  gitVersion: v1.21.14-eks-18ef993
  goVersion: go1.16.15
  major: "1"
  minor: 21+
  platform: linux/amd64

hpa debug

sk@mac ~ % kubectl describe hpa test-hpa -n bart                            
Name:                                                  test-hpa
Namespace:                                             bart
Labels:                                                app.kubernetes.io/managed-by=Helm
Annotations:                                           meta.helm.sh/release-name: test-bart
                                                       meta.helm.sh/release-namespace: bart
CreationTimestamp:                                     Wed, 07 Sep 2022 18:17:15 +0300
Reference:                                             Deployment/test-88
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 60%
Min replicas:                                          2
Max replicas:                                          20
Deployment pods:                                       2 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: failed to get cpu utilization: missing request for cpu
Events:
  Type     Reason                   Age                     From                       Message
  ----     ------                   ----                    ----                       -------
  Warning  FailedGetResourceMetric  2m45s (x4359 over 18h)  horizontal-pod-autoscaler  failed to get cpu utilization: missing request for cpu

metrics-server logs

I0907 15:32:51.346245       1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0907 15:32:51.742115       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0907 15:32:51.742137       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0907 15:32:51.742142       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0907 15:32:51.742151       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0907 15:32:51.742172       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0907 15:32:51.742179       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0907 15:32:51.742876       1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0907 15:32:51.743033       1 secure_serving.go:266] Serving securely on [::]:4443
I0907 15:32:51.743072       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
W0907 15:32:51.743156       1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I0907 15:32:51.842769       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 
I0907 15:32:51.842791       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
I0907 15:32:51.842810       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController 
I0907 15:33:19.126706       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0907 15:33:29.125385       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0907 15:33:39.126147       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0907 15:33:49.125540       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0907 15:33:59.125736       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0907 15:34:09.125897       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0907 15:34:19.127252       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0907 15:34:29.125685       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

Any ideas guys how to debug or fix it?

serg4kostiuk avatar Sep 08 '22 09:09 serg4kostiuk

Hi @serg4kostiuk Thanks for the feedback. You can refer to the known issues below to confirm whether the metrics obtained from the kubelet are incomplete? https://github.com/kubernetes-sigs/metrics-server/blob/master/KNOWN_ISSUES.md#kubelet-doesnt-report-metrics-for-all-or-subset-of-nodes https://github.com/kubernetes-sigs/metrics-server/blob/master/KNOWN_ISSUES.md#kubelet-doesnt-report-pod-metrics

yangjunmyfm192085 avatar Sep 08 '22 10:09 yangjunmyfm192085

thanks, @yangjunmyfm192085, already did it. the output:

sk@mac ~ % kubectl top nodes
W0908 14:58:55.031967   33507 top_node.go:119] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-100-100-174.ec2.internal   196m         2%     12382Mi         85%       
ip-10-100-100-8.ec2.internal     218m         5%     3413Mi          51%       
ip-10-100-101-169.ec2.internal   103m         1%     3165Mi          21%       
ip-10-100-101-229.ec2.internal   121m         1%     7005Mi          48%    

metrics

sk@mac ~ % NODE_NAME=ip-10-100-100-174.ec2.internal
kubectl get --raw /api/v1/nodes/$NODE_NAME/proxy/metrics/resource
# HELP container_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the container in core-seconds
# TYPE container_cpu_usage_seconds_total counter
container_cpu_usage_seconds_total{container="agent",namespace="newrelic",pod="newrelic-nrk8s-kubelet-bvjsd"} 418.524933302 1662638501027
container_cpu_usage_seconds_total{container="aws-node",namespace="kube-system",pod="aws-node-npt8s"} 312.666112546 1662638509159
container_cpu_usage_seconds_total{container="aws-node-termination-handler",namespace="kube-system",pod="aws-node-termination-handler-vskng"} 41.500486192 1662638510942
container_cpu_usage_seconds_total{container="kube-proxy",namespace="kube-system",pod="kube-proxy-5bb56"} 35.132398838 1662638498393
container_cpu_usage_seconds_total{container="kubelet",namespace="newrelic",pod="newrelic-nrk8s-kubelet-bvjsd"} 809.146088806 1662638508346
container_cpu_usage_seconds_total{container="newrelic-logging",namespace="newrelic",pod="newrelic-newrelic-logging-8ztfv"} 666.912120419 1662638506883
container_cpu_usage_seconds_total{container="test",namespace="voidnew",pod="test-5799b7b769-hncfb"} 951.648087092 1662638501570
container_cpu_usage_seconds_total{container="test-admin",namespace="bart",pod="test-admin-5b47b858f4-88cmt"} 15.241756556 1662638499313
container_cpu_usage_seconds_total{container="test-admin",namespace="voidnew",pod="test-admin-659cf556cc-qclcz"} 594.401490141 1662638502666
container_cpu_usage_seconds_total{container="test-nginx",namespace="bart",pod="test-admin-5b47b858f4-88cmt"} 0.073064288 1662638507147
container_cpu_usage_seconds_total{container="test-nginx",namespace="voidnew",pod="test-5799b7b769-hncfb"} 20.366364436 1662638503771
container_cpu_usage_seconds_total{container="test-nginx",namespace="voidnew",pod="test-admin-659cf556cc-qclcz"} 4.621754426 1662638499878
container_cpu_usage_seconds_total{container="test-shoryuken",namespace="voidnew",pod="test-shoryuken-6bb9c696c6-lqdl4"} 45.107665199 1662638499624
container_cpu_usage_seconds_total{container="test-sidekiq-api-stats",namespace="bart",pod="test-sidekiq-api-stats-6c5b94756c-fchqw"} 104.431471208 1662638508859
container_cpu_usage_seconds_total{container="test-sidekiq-api-stats",namespace="voidnew",pod="test-sidekiq-api-stats-676d9bcf67-m2429"} 88.972119973 1662638505879
container_cpu_usage_seconds_total{container="test-sidekiq-default",namespace="bart",pod="test-sidekiq-default-6b5b648b85-nfv6v"} 400.918358602 1662638510012
container_cpu_usage_seconds_total{container="test-sidekiq-default",namespace="voidnew",pod="test-sidekiq-default-7f6556f77f-mr2dx"} 686.062540092 1662638503859
container_cpu_usage_seconds_total{container="test-sidekiq-export",namespace="voidnew",pod="test-sidekiq-export-557dd4cbf6-l955f"} 243.546512455 1662638511639
container_cpu_usage_seconds_total{container="test-sidekiq-mailer",namespace="bart",pod="test-sidekiq-mailer-56f59cdcc6-zhhqt"} 515.512617825 1662638497654
container_cpu_usage_seconds_total{container="test-sidekiq-mailer",namespace="voidnew",pod="test-sidekiq-mailer-8576d8fdd9-94bbt"} 449.133985296 1662638501244
container_cpu_usage_seconds_total{container="test-sidekiq-page-export",namespace="bart",pod="test-sidekiq-page-export-544fcb6bc-c4hpk"} 214.001744333 1662638509405
container_cpu_usage_seconds_total{container="test-sidekiq-page-export",namespace="voidnew",pod="test-sidekiq-page-export-7b9cdd66c9-d4dlg"} 226.575217255 1662638497931
container_cpu_usage_seconds_total{container="test-sidekiq-reporting",namespace="bart",pod="test-sidekiq-reporting-577864846-jfnwx"} 293.053909902 1662638506392
container_cpu_usage_seconds_total{container="test-sidekiq-reporting",namespace="voidnew",pod="test-sidekiq-reporting-76c8bd94fb-nh7ct"} 333.585208812 1662638511944
container_cpu_usage_seconds_total{container="test-sidekiq-support",namespace="bart",pod="test-sidekiq-support-6bdb69567d-gxnwv"} 152.612587484 1662638507206
container_cpu_usage_seconds_total{container="test-sidekiq-support",namespace="voidnew",pod="test-sidekiq-support-57dbb94478-t8x69"} 133.418589795 1662638511717
container_cpu_usage_seconds_total{container="test-sidekiq-webhook",namespace="bart",pod="test-sidekiq-webhook-c4b779697-khmcg"} 189.322319754 1662638503264
container_cpu_usage_seconds_total{container="test-sidekiq-webhook",namespace="voidnew",pod="test-sidekiq-webhook-54687bf855-45jj8"} 20.553327681 1662638507288
# HELP container_memory_working_set_bytes [ALPHA] Current working set of the container in bytes
# TYPE container_memory_working_set_bytes gauge
container_memory_working_set_bytes{container="agent",namespace="newrelic",pod="newrelic-nrk8s-kubelet-bvjsd"} 3.0031872e+07 1662638501027
container_memory_working_set_bytes{container="aws-node",namespace="kube-system",pod="aws-node-npt8s"} 5.1376128e+07 1662638509159
container_memory_working_set_bytes{container="aws-node-termination-handler",namespace="kube-system",pod="aws-node-termination-handler-vskng"} 1.3918208e+07 1662638510942
container_memory_working_set_bytes{container="kube-proxy",namespace="kube-system",pod="kube-proxy-5bb56"} 2.3269376e+07 1662638498393
container_memory_working_set_bytes{container="kubelet",namespace="newrelic",pod="newrelic-nrk8s-kubelet-bvjsd"} 2.6693632e+07 1662638508346
container_memory_working_set_bytes{container="newrelic-logging",namespace="newrelic",pod="newrelic-newrelic-logging-8ztfv"} 3.442688e+07 1662638506883
container_memory_working_set_bytes{container="test",namespace="voidnew",pod="test-5799b7b769-hncfb"} 2.48608768e+09 1662638501570
container_memory_working_set_bytes{container="test-admin",namespace="bart",pod="test-admin-5b47b858f4-88cmt"} 1.324457984e+09 1662638499313
container_memory_working_set_bytes{container="test-admin",namespace="voidnew",pod="test-admin-659cf556cc-qclcz"} 2.555981824e+09 1662638502666
container_memory_working_set_bytes{container="test-nginx",namespace="bart",pod="test-admin-5b47b858f4-88cmt"} 3.473408e+06 1662638507147
container_memory_working_set_bytes{container="test-nginx",namespace="voidnew",pod="test-5799b7b769-hncfb"} 3.977216e+06 1662638503771
container_memory_working_set_bytes{container="test-nginx",namespace="voidnew",pod="test-admin-659cf556cc-qclcz"} 4.56704e+06 1662638499878
container_memory_working_set_bytes{container="test-shoryuken",namespace="voidnew",pod="test-shoryuken-6bb9c696c6-lqdl4"} 3.3134592e+08 1662638499624
container_memory_working_set_bytes{container="test-sidekiq-api-stats",namespace="bart",pod="test-sidekiq-api-stats-6c5b94756c-fchqw"} 3.41594112e+08 1662638508859
container_memory_working_set_bytes{container="test-sidekiq-api-stats",namespace="voidnew",pod="test-sidekiq-api-stats-676d9bcf67-m2429"} 3.31546624e+08 1662638505879
container_memory_working_set_bytes{container="test-sidekiq-default",namespace="bart",pod="test-sidekiq-default-6b5b648b85-nfv6v"} 3.4945024e+08 1662638510012
container_memory_working_set_bytes{container="test-sidekiq-default",namespace="voidnew",pod="test-sidekiq-default-7f6556f77f-mr2dx"} 6.06142464e+08 1662638503859
container_memory_working_set_bytes{container="test-sidekiq-export",namespace="voidnew",pod="test-sidekiq-export-557dd4cbf6-l955f"} 3.58084608e+08 1662638511639
container_memory_working_set_bytes{container="test-sidekiq-mailer",namespace="bart",pod="test-sidekiq-mailer-56f59cdcc6-zhhqt"} 3.61381888e+08 1662638497654
container_memory_working_set_bytes{container="test-sidekiq-mailer",namespace="voidnew",pod="test-sidekiq-mailer-8576d8fdd9-94bbt"} 3.6489216e+08 1662638501244
container_memory_working_set_bytes{container="test-sidekiq-page-export",namespace="bart",pod="test-sidekiq-page-export-544fcb6bc-c4hpk"} 3.70192384e+08 1662638509405
container_memory_working_set_bytes{container="test-sidekiq-page-export",namespace="voidnew",pod="test-sidekiq-page-export-7b9cdd66c9-d4dlg"} 3.95972608e+08 1662638497931
container_memory_working_set_bytes{container="test-sidekiq-reporting",namespace="bart",pod="test-sidekiq-reporting-577864846-jfnwx"} 3.48819456e+08 1662638506392
container_memory_working_set_bytes{container="test-sidekiq-reporting",namespace="voidnew",pod="test-sidekiq-reporting-76c8bd94fb-nh7ct"} 3.26725632e+08 1662638511944
container_memory_working_set_bytes{container="test-sidekiq-support",namespace="bart",pod="test-sidekiq-support-6bdb69567d-gxnwv"} 3.33975552e+08 1662638507206
container_memory_working_set_bytes{container="test-sidekiq-support",namespace="voidnew",pod="test-sidekiq-support-57dbb94478-t8x69"} 3.49630464e+08 1662638511717
container_memory_working_set_bytes{container="test-sidekiq-webhook",namespace="bart",pod="test-sidekiq-webhook-c4b779697-khmcg"} 3.43396352e+08 1662638503264
container_memory_working_set_bytes{container="test-sidekiq-webhook",namespace="voidnew",pod="test-sidekiq-webhook-54687bf855-45jj8"} 3.424256e+08 1662638507288
# HELP node_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the node in core-seconds
# TYPE node_cpu_usage_seconds_total counter
node_cpu_usage_seconds_total 22426.988613964 1662638508463
# HELP node_memory_working_set_bytes [ALPHA] Current working set of the node in bytes
# TYPE node_memory_working_set_bytes gauge
node_memory_working_set_bytes 1.2950503424e+10 1662638508463
# HELP pod_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the pod in core-seconds
# TYPE pod_cpu_usage_seconds_total counter
pod_cpu_usage_seconds_total{namespace="bart",pod="test-admin-5b47b858f4-88cmt"} 15.499669223 1662638511093
pod_cpu_usage_seconds_total{namespace="bart",pod="test-sidekiq-api-stats-6c5b94756c-fchqw"} 104.654907175 1662638501088
pod_cpu_usage_seconds_total{namespace="bart",pod="test-sidekiq-default-6b5b648b85-nfv6v"} 401.04078175 1662638500826
pod_cpu_usage_seconds_total{namespace="bart",pod="test-sidekiq-mailer-56f59cdcc6-zhhqt"} 515.771736458 1662638507253
pod_cpu_usage_seconds_total{namespace="bart",pod="test-sidekiq-page-export-544fcb6bc-c4hpk"} 214.154657375 1662638506045
pod_cpu_usage_seconds_total{namespace="bart",pod="test-sidekiq-reporting-577864846-jfnwx"} 293.230341888 1662638495754
pod_cpu_usage_seconds_total{namespace="bart",pod="test-sidekiq-support-6bdb69567d-gxnwv"} 152.754955701 1662638500843
pod_cpu_usage_seconds_total{namespace="bart",pod="test-sidekiq-webhook-c4b779697-khmcg"} 189.489380615 1662638503680
pod_cpu_usage_seconds_total{namespace="kube-system",pod="aws-node-npt8s"} 312.706103434 1662638501756
pod_cpu_usage_seconds_total{namespace="kube-system",pod="aws-node-termination-handler-vskng"} 41.653263798 1662638501680
pod_cpu_usage_seconds_total{namespace="kube-system",pod="kube-proxy-5bb56"} 35.150185947 1662638505392
pod_cpu_usage_seconds_total{namespace="newrelic",pod="newrelic-newrelic-logging-8ztfv"} 667.08169285 1662638507505
pod_cpu_usage_seconds_total{namespace="newrelic",pod="newrelic-nrk8s-kubelet-bvjsd"} 1227.605221778 1662638504474
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-5799b7b769-hncfb"} 972.175565237 1662638502790
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-admin-659cf556cc-qclcz"} 599.177898867 1662638502685
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-shoryuken-6bb9c696c6-lqdl4"} 45.27940431 1662638505950
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-sidekiq-api-stats-676d9bcf67-m2429"} 105.305951433 1662638507380
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-sidekiq-default-7f6556f77f-mr2dx"} 686.245633742 1662638506965
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-sidekiq-export-557dd4cbf6-l955f"} 243.701396206 1662638504528
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-sidekiq-mailer-8576d8fdd9-94bbt"} 449.349414714 1662638502155
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-sidekiq-page-export-7b9cdd66c9-d4dlg"} 226.830506739 1662638509789
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-sidekiq-reporting-76c8bd94fb-nh7ct"} 333.764302034 1662638503290
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-sidekiq-support-57dbb94478-t8x69"} 133.610564259 1662638500707
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test-sidekiq-webhook-54687bf855-45jj8"} 225.202388868 1662638507478
# HELP pod_memory_working_set_bytes [ALPHA] Current working set of the pod in bytes
# TYPE pod_memory_working_set_bytes gauge
pod_memory_working_set_bytes{namespace="bart",pod="test-admin-5b47b858f4-88cmt"} 1.329102848e+09 1662638511093
pod_memory_working_set_bytes{namespace="bart",pod="test-sidekiq-api-stats-6c5b94756c-fchqw"} 3.42798336e+08 1662638501088
pod_memory_working_set_bytes{namespace="bart",pod="test-sidekiq-default-6b5b648b85-nfv6v"} 3.48680192e+08 1662638500826
pod_memory_working_set_bytes{namespace="bart",pod="test-sidekiq-mailer-56f59cdcc6-zhhqt"} 3.6145152e+08 1662638507253
pod_memory_working_set_bytes{namespace="bart",pod="test-sidekiq-page-export-544fcb6bc-c4hpk"} 3.7117952e+08 1662638506045
pod_memory_working_set_bytes{namespace="bart",pod="test-sidekiq-reporting-577864846-jfnwx"} 3.5037184e+08 1662638495754
pod_memory_working_set_bytes{namespace="bart",pod="test-sidekiq-support-6bdb69567d-gxnwv"} 3.34823424e+08 1662638500843
pod_memory_working_set_bytes{namespace="bart",pod="test-sidekiq-webhook-c4b779697-khmcg"} 3.44469504e+08 1662638503680
pod_memory_working_set_bytes{namespace="kube-system",pod="aws-node-npt8s"} 5.255168e+07 1662638501756
pod_memory_working_set_bytes{namespace="kube-system",pod="aws-node-termination-handler-vskng"} 1.4897152e+07 1662638501680
pod_memory_working_set_bytes{namespace="kube-system",pod="kube-proxy-5bb56"} 2.4203264e+07 1662638505392
pod_memory_working_set_bytes{namespace="newrelic",pod="newrelic-newrelic-logging-8ztfv"} 3.5241984e+07 1662638507505
pod_memory_working_set_bytes{namespace="newrelic",pod="newrelic-nrk8s-kubelet-bvjsd"} 5.3981184e+07 1662638504474
pod_memory_working_set_bytes{namespace="voidnew",pod="test-5799b7b769-hncfb"} 2.491006976e+09 1662638502790
pod_memory_working_set_bytes{namespace="voidnew",pod="test-admin-659cf556cc-qclcz"} 2.561667072e+09 1662638502685
pod_memory_working_set_bytes{namespace="voidnew",pod="test-shoryuken-6bb9c696c6-lqdl4"} 3.32480512e+08 1662638505950
pod_memory_working_set_bytes{namespace="voidnew",pod="test-sidekiq-api-stats-676d9bcf67-m2429"} 3.3452032e+08 1662638507380
pod_memory_working_set_bytes{namespace="voidnew",pod="test-sidekiq-default-7f6556f77f-mr2dx"} 6.06650368e+08 1662638506965
pod_memory_working_set_bytes{namespace="voidnew",pod="test-sidekiq-export-557dd4cbf6-l955f"} 3.59264256e+08 1662638504528
pod_memory_working_set_bytes{namespace="voidnew",pod="test-sidekiq-mailer-8576d8fdd9-94bbt"} 3.66575616e+08 1662638502155
pod_memory_working_set_bytes{namespace="voidnew",pod="test-sidekiq-page-export-7b9cdd66c9-d4dlg"} 3.9723008e+08 1662638509789
pod_memory_working_set_bytes{namespace="voidnew",pod="test-sidekiq-reporting-76c8bd94fb-nh7ct"} 3.25435392e+08 1662638503290
pod_memory_working_set_bytes{namespace="voidnew",pod="test-sidekiq-support-57dbb94478-t8x69"} 3.50785536e+08 1662638500707
pod_memory_working_set_bytes{namespace="voidnew",pod="test-sidekiq-webhook-54687bf855-45jj8"} 3.43441408e+08 1662638507478
# HELP scrape_error [ALPHA] 1 if there was an error while getting container metrics, 0 otherwise
# TYPE scrape_error gauge
scrape_error 0

I use actually 0.6.1 version of metrics-server, but the output for 0.5.1 is following:

sk@mac ~ % kubectl get --raw /api/v1/nodes/$NODE_NAME/proxy/stats/summary | jq '{cpu: .node.cpu, memory: .node.memory}'

{
  "cpu": {
    "time": "2022-09-08T12:05:59Z",
    "usageNanoCores": 144794016,
    "usageCoreNanoSeconds": 22477182020253
  },
  "memory": {
    "time": "2022-09-08T12:05:59Z",
    "availableBytes": 3352186880,
    "usageBytes": 14265958400,
    "workingSetBytes": 12950142976,
    "rssBytes": 12782141440,
    "pageFaults": 3177570,
    "majorPageFaults": 924
  }
}

and cgroups check

[root@ip-10-100-100-174 /]# mount | grep group
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)

serg4kostiuk avatar Sep 08 '22 12:09 serg4kostiuk

Hi, @serg4kostiuk From the data obtained above, it looks normal. Could kubectl top pod -n bart display pod metrics correctly?

yangjunmyfm192085 avatar Sep 08 '22 14:09 yangjunmyfm192085

sure, @yangjunmyfm192085

sk@mac ~ % kubectl top pod -n bart
W0909 14:12:14.449355    4909 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                                           CPU(cores)   MEMORY(bytes)   
aaa-web-88dc87987-m7p95                        3m           1414Mi          
aaa-web-88dc87987-sc25d                        2m           1040Mi          
admin-7c69c54b65-64lpx                         3m           959Mi           
admin-7c69c54b65-c2gbc                         4m           975Mi                       
test-sidekiq-api-stats-6c5b94756c-cps8s    2m           335Mi           
test-sidekiq-default-6b5b648b85-hd78g      4m           334Mi           
test-sidekiq-export-777878665c-j7kh6       2m           345Mi           
test-sidekiq-heavy-54c966b8fd-zclzs        2m           319Mi           
test-sidekiq-mailer-56f59cdcc6-hnrvs       5m           347Mi           
test-sidekiq-page-export-544fcb6bc-2rlrq   2m           322Mi           
test-sidekiq-reporting-577864846-9gmks     3m           340Mi           
test-sidekiq-support-6bdb69567d-q8nnv      2m           307Mi           
test-sidekiq-webhook-c4b779697-cmws8       2m           324Mi 
sk@mac ~ % kubectl top nodes      
W0909 14:13:22.379838    4925 top_node.go:119] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-100-100-208.ec2.internal   131m         1%     3411Mi          23%       
ip-10-100-100-8.ec2.internal     238m         6%     3846Mi          58%       
ip-10-100-100-81.ec2.internal    156m         1%     7801Mi          53%       
ip-10-100-101-138.ec2.internal   95m          1%     1766Mi          12%       
ip-10-100-101-241.ec2.internal   615m         7%     9177Mi          63%   

plus a few new errors in the metrics server pod

http2: server connection error from 10.100.101.182:41370: connection error: PROTOCOL_ERROR
http2: server connection error from 10.100.101.182:41370: connection error: PROTOCOL_ERROR
http2: server connection error from 10.100.101.182:39542: connection error: PROTOCOL_ERROR
E0909 07:14:50.107099       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.15:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-100-15.ec2.internal"
E0909 07:30:33.603372       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.15:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-100-15.ec2.internal"
E0909 07:30:48.603828       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.15:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-100-15.ec2.internal"
E0909 07:31:03.603037       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.15:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-100-15.ec2.internal"
E0909 07:31:18.603641       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.15:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-100-15.ec2.internal"
E0909 07:31:33.603760       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.15:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-100-15.ec2.internal"
E0909 09:55:20.132726       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.81:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-100-81.ec2.internal"
E0909 10:01:05.127457       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.83:10250/metrics/resource\": dial tcp 10.100.101.83:10250: connect: connection refused" node="ip-10-100-101-83.ec2.internal"
E0909 10:01:33.603429       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.83:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-83.ec2.internal"
E0909 10:01:48.603599       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.83:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-83.ec2.internal"

my server-metrics deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-server
  namespace: kube-system
  labels:
    k8s-app: metrics-server
spec:
  strategy:
    rollingUpdate:
      maxUnavailable: 0
  selector:
    matchLabels:
      k8s-app: metrics-server
  template:
    metadata:
      labels:
        k8s-app: metrics-server
    spec:
      serviceAccountName: metrics-server
      volumes:
      # mount in tmp so we can safely use from-scratch images and/or read-only containers
      - name: tmp-dir
        emptyDir: { }
      priorityClassName: system-cluster-critical
      containers:
      - name: metrics-server
        image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
        imagePullPolicy: IfNotPresent
        args:
          - --cert-dir=/tmp
          - --secure-port=4443
          - --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
          - --kubelet-use-node-status-port
          - --metric-resolution=15s
          - --kubelet-insecure-tls=true
        resources:
          requests:
            cpu:    200m
            memory: 300Mi
        ports:
        - name: https
          containerPort: 4443
          protocol:      TCP
        readinessProbe:
          httpGet:
            path:   /readyz
            port:   https
            scheme: HTTPS
          periodSeconds: 10
          failureThreshold: 3
          initialDelaySeconds: 20
        livenessProbe:
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          failureThreshold: 3
          periodSeconds: 10
        securityContext:
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          allowPrivilegeEscalation: false
        volumeMounts:
        - name: tmp-dir
          mountPath: /tmp
      hostNetwork: true
      nodeSelector:
        kubernetes.io/os: linux

serg4kostiuk avatar Sep 09 '22 11:09 serg4kostiuk

Thanks for the feedback. metrics-server appears to be in normal state. Is there still any problem with hpa?

yangjunmyfm192085 avatar Sep 09 '22 12:09 yangjunmyfm192085

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 08 '22 13:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 07 '23 14:01 k8s-triage-robot

E0909 10:01:05.127457       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.83:10250/metrics/resource\": dial tcp 10.100.101.83:10250: connect: connection refused" node="ip-10-100-101-83.ec2.internal"

@serg4kostiuk Are you positive that 10250 is exposed?

rexagod avatar Jan 30 '23 12:01 rexagod

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 01 '23 12:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 01 '23 12:03 k8s-ci-robot

I am having a similar issue, but it seems to me that the metrics-server is actually looking for the wrong (non-existent) nodes. From where it should take the list of nodes to scrape? kubelet service?

ricardosilva86 avatar Mar 13 '23 12:03 ricardosilva86

hello there, i'm having kind of the same issue. I have these logs

PS C:\windows\system32> kubectl logs -f metrics-server-58b7f877fc-67txx -n kube-system I0315 10:08:23.309378 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key) I0315 10:08:23.681150 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController I0315 10:08:23.681186 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController I0315 10:08:23.681160 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" I0315 10:08:23.681197 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0315 10:08:23.681161 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" I0315 10:08:23.681304 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0315 10:08:23.681454 1 secure_serving.go:266] Serving securely on [::]:10250 W0315 10:08:23.681489 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed I0315 10:08:23.681525 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key" I0315 10:08:23.681535 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" I0315 10:08:23.781829 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController I0315 10:08:23.781854 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0315 10:08:23.781842 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0315 10:08:42.689261 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve" I0315 10:08:52.689153 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve" E0315 10:08:53.688761 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.49.161.59:10250/metrics/resource": dial tcp 10.49.161.59:10250: i/o timeout" node="ip-10-49-161-59.eu-west-3.compute.internal" E0315 10:08:53.691870 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.49.161.177:10250/metrics/resource": dial tcp 10.49.161.177:10250: i/o timeout" node="ip-10-49-161-177.eu-west-3.compute.internal" E0315 10:08:53.692942 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.49.161.214:10250/metrics/resource": dial tcp 10.49.161.214:10250: i/o timeout" node="ip-10-49-161-214.eu-west-3.compute.internal" E0315 10:08:53.708202 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.49.161

metrics-server doesn't seem to see the other nodes in the cluster although they all have the same configuration and they all have the port 10250 TCP configured on the sg PS C:\windows\system32> kubectl top nodes -n kube-system NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-49-161-64.eu-west-3.compute.internal 55m 1% 1959Mi 13% ip-10-49-161-125.eu-west-3.compute.internal ip-10-49-161-160.eu-west-3.compute.internal ip-10-49-161-177.eu-west-3.compute.internal ip-10-49-161-252.eu-west-3.compute.internal ip-10-49-161-59.eu-west-3.compute.internal

I just wanna add that we're running metrics-server 0.6.1 and EKS 1.25. I've already applied all the hacks and workarrounds mentioned, like metric-resolution; preferred address-types , --kubelet-insecure-tls=true and none of them help solve the issue. please anyone here to help??

sichiba avatar Mar 15 '23 12:03 sichiba

@sichiba try this:

helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm install metrics-server \
  metrics-server/metrics-server \
  -n kube-system \
  --version 3.9.0 \
  -f metrics-server/values.yaml \
  --wait
# Fixes for error:
# "couldn't get current server API group list: the server has asked for the client to provide credentials"
# inspired by https://stackoverflow.com/a/75266326
# related discussion https://github.com/kubernetes-sigs/metrics-server/issues/157
hostNetwork:
  enabled: true
args:
  - --kubelet-insecure-tls

at least this helped me.

I still get:

couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized

but it's not block

Hronom avatar Apr 09 '23 00:04 Hronom

also having the same issue where metrics server is trying to scrape old nodes which have been terminated. It occasionally scrapes the new nodes and receives a remote error: tls: internal error

skan-splunk avatar Apr 20 '23 04:04 skan-splunk

also having the same issue where metrics server is trying to scrape old nodes which have been terminated. It occasionally scrapes the new nodes and receives a remote error: tls: internal error

Exactly the same for me. It seems to occur when a new node appears, then nothing

gmolaire avatar Feb 12 '24 15:02 gmolaire