metrics-server icon indicating copy to clipboard operation
metrics-server copied to clipboard

Metrics-server doesn't share CPU metrics for second HPA

Open serg4kostiuk opened this issue 3 years ago • 3 comments

Hi guys, my case is next: I have:

  • two different deployments (two pods and two containers inside each pod).
  • two HPAs for each deployment.
  • one metrics-server pod in the same (with deployments) namespace. The problem is - the metrics server doesn't send the CPU metric to one of the HPA. I say CPU, because during my test I tried to set the autoscaler threshold to check CPU+Memory and memory values were shared with the HPA by the metrics server. below are my deployments\debug and some versions.

kubectl version

sk@mac talkable % kubectl version --output=yaml
clientVersion:
  buildDate: "2022-06-15T14:14:10Z"
  compiler: gc
  gitCommit: f66044f4361b9f1f96f0053dd46cb7dce5e990a8
  gitTreeState: clean
  gitVersion: v1.24.2
  goVersion: go1.18.3
  major: "1"
  minor: "24"
  platform: darwin/arm64
kustomizeVersion: v4.5.4
serverVersion:
  buildDate: "2022-07-06T18:06:50Z"
  compiler: gc
  gitCommit: ac73613dfd25370c18cbbbc6bfc65449397b35c7
  gitTreeState: clean
  gitVersion: v1.21.14-eks-18ef993
  goVersion: go1.16.15
  major: "1"
  minor: 21+
  platform: linux/amd64

my metrics-server deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-server
  namespace: kube-system
  uid: f603713a-e6c4-45b5-a4f4-2bfa3a228150
  resourceVersion: '66802670'
  generation: 3
  creationTimestamp: '2022-09-08T13:49:19Z'
  labels:
    k8s-app: metrics-server
  annotations:
    deployment.kubernetes.io/revision: '1'
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"k8s-app":"metrics-server"},"name":"metrics-server","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"metrics-server"}},"strategy":{"rollingUpdate":{"maxUnavailable":0}},"template":{"metadata":{"labels":{"k8s-app":"metrics-server"}},"spec":{"containers":[{"args":["--cert-dir=/tmp","--secure-port=4443","--kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP","--kubelet-use-node-status-port","--metric-resolution=15s","--kubelet-insecure-tls=true"],"image":"k8s.gcr.io/metrics-server/metrics-server:v0.6.1","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"path":"/livez","port":"https","scheme":"HTTPS"},"periodSeconds":10},"name":"metrics-server","ports":[{"containerPort":4443,"name":"https","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"path":"/readyz","port":"https","scheme":"HTTPS"},"initialDelaySeconds":20,"periodSeconds":10},"resources":{"requests":{"cpu":"200m","memory":"300Mi"}},"securityContext":{"allowPrivilegeEscalation":false,"readOnlyRootFilesystem":true,"runAsNonRoot":true,"runAsUser":1000},"volumeMounts":[{"mountPath":"/tmp","name":"tmp-dir"}]}],"hostNetwork":true,"nodeSelector":{"kubernetes.io/os":"linux"},"priorityClassName":"system-cluster-critical","serviceAccountName":"metrics-server","volumes":[{"emptyDir":{},"name":"tmp-dir"}]}}}}
  managedFields:
    - manager: kubectl-client-side-apply
      operation: Update
      apiVersion: apps/v1
      time: '2022-09-08T13:49:19Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:labels:
            .: {}
            f:k8s-app: {}
        f:spec:
          f:progressDeadlineSeconds: {}
          f:replicas: {}
          f:revisionHistoryLimit: {}
          f:selector: {}
          f:strategy:
            f:rollingUpdate:
              .: {}
              f:maxSurge: {}
              f:maxUnavailable: {}
            f:type: {}
          f:template:
            f:metadata:
              f:labels:
                .: {}
                f:k8s-app: {}
            f:spec:
              f:containers:
                k:{"name":"metrics-server"}:
                  .: {}
                  f:args: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:livenessProbe:
                    .: {}
                    f:failureThreshold: {}
                    f:httpGet:
                      .: {}
                      f:path: {}
                      f:port: {}
                      f:scheme: {}
                    f:periodSeconds: {}
                    f:successThreshold: {}
                    f:timeoutSeconds: {}
                  f:name: {}
                  f:ports:
                    .: {}
                    k:{"containerPort":4443,"protocol":"TCP"}:
                      .: {}
                      f:containerPort: {}
                      f:hostPort: {}
                      f:name: {}
                      f:protocol: {}
                  f:readinessProbe:
                    .: {}
                    f:failureThreshold: {}
                    f:httpGet:
                      .: {}
                      f:path: {}
                      f:port: {}
                      f:scheme: {}
                    f:initialDelaySeconds: {}
                    f:periodSeconds: {}
                    f:successThreshold: {}
                    f:timeoutSeconds: {}
                  f:resources:
                    .: {}
                    f:requests:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                  f:securityContext:
                    .: {}
                    f:allowPrivilegeEscalation: {}
                    f:readOnlyRootFilesystem: {}
                    f:runAsNonRoot: {}
                    f:runAsUser: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/tmp"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
              f:dnsPolicy: {}
              f:hostNetwork: {}
              f:nodeSelector:
                .: {}
                f:kubernetes.io/os: {}
              f:priorityClassName: {}
              f:restartPolicy: {}
              f:schedulerName: {}
              f:securityContext: {}
              f:serviceAccount: {}
              f:serviceAccountName: {}
              f:terminationGracePeriodSeconds: {}
              f:volumes:
                .: {}
                k:{"name":"tmp-dir"}:
                  .: {}
                  f:emptyDir: {}
                  f:name: {}
    - manager: kube-controller-manager
      operation: Update
      apiVersion: apps/v1
      time: '2022-09-14T10:19:49Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:deployment.kubernetes.io/revision: {}
        f:status:
          f:availableReplicas: {}
          f:conditions:
            .: {}
            k:{"type":"Available"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Progressing"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:observedGeneration: {}
          f:readyReplicas: {}
          f:replicas: {}
          f:updatedReplicas: {}
  selfLink: /apis/apps/v1/namespaces/kube-system/deployments/metrics-server
status:
  observedGeneration: 3
  replicas: 5
  updatedReplicas: 5
  readyReplicas: 5
  availableReplicas: 5
  conditions:
    - type: Progressing
      status: 'True'
      lastUpdateTime: '2022-09-08T13:49:49Z'
      lastTransitionTime: '2022-09-08T13:49:19Z'
      reason: NewReplicaSetAvailable
      message: ReplicaSet "metrics-server-d4db696ff" has successfully progressed.
    - type: Available
      status: 'True'
      lastUpdateTime: '2022-09-14T10:19:49Z'
      lastTransitionTime: '2022-09-14T10:19:49Z'
      reason: MinimumReplicasAvailable
      message: Deployment has minimum availability.
spec:
  replicas: 5
  selector:
    matchLabels:
      k8s-app: metrics-server
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: metrics-server
    spec:
      volumes:
        - name: tmp-dir
          emptyDir: {}
      containers:
        - name: metrics-server
          image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
          args:
            - '--cert-dir=/tmp'
            - '--secure-port=4443'
            - >-
              --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
            - '--kubelet-use-node-status-port'
            - '--metric-resolution=15s'
            - '--kubelet-insecure-tls=true'
          ports:
            - name: https
              hostPort: 4443
              containerPort: 4443
              protocol: TCP
          resources:
            requests:
              cpu: 200m
              memory: 300Mi
          volumeMounts:
            - name: tmp-dir
              mountPath: /tmp
          livenessProbe:
            httpGet:
              path: /livez
              port: https
              scheme: HTTPS
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz
              port: https
              scheme: HTTPS
            initialDelaySeconds: 20
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            runAsUser: 1000
            runAsNonRoot: true
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: metrics-server
      serviceAccount: metrics-server
      hostNetwork: true
      securityContext: {}
      schedulerName: default-scheduler
      priorityClassName: system-cluster-critical
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

metrics-server pods log

http2: server connection error from 10.100.100.79:35886: connection error: PROTOCOL_ERROR
http2: server connection error from 10.100.101.182:59154: connection error: PROTOCOL_ERROR
http2: server connection error from 10.100.101.182:59154: connection error: PROTOCOL_ERROR
http2: server connection error from 10.100.100.79:46480: connection error: PROTOCOL_ERROR
http2: server connection error from 10.100.100.79:46480: connection error: PROTOCOL_ERROR
E0914 04:31:20.140554       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.20:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-101-20.ec2.internal"
E0914 04:31:35.114571       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.165:10250/metrics/resource\": dial tcp 10.100.101.165:10250: connect: connection refused" node="ip-10-100-101-165.ec2.internal"
E0914 04:32:03.603806       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.165:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-165.ec2.internal"
E0914 04:32:18.604001       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.165:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-165.ec2.internal"
E0914 04:32:33.604932       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.165:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-165.ec2.internal"
E0914 04:36:20.109588       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.234:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-101-234.ec2.internal"
http2: server connection error from 10.100.101.182:35710: connection error: PROTOCOL_ERROR
E0914 04:51:48.603743       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.234:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-234.ec2.internal"
E0914 04:52:03.603706       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.234:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-234.ec2.internal"
E0914 04:52:18.603738       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.234:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-234.ec2.internal"
E0914 04:52:33.602756       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.234:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-234.ec2.internal"
E0914 04:52:48.602962       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.234:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-234.ec2.internal"
http2: server connection error from 10.100.100.79:40706: connection error: PROTOCOL_ERROR
http2: server connection error from 10.100.101.182:33332: connection error: PROTOCOL_ERROR
E0914 09:58:35.117983       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.7:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-101-7.ec2.internal"
E0914 09:59:05.111799       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.56:10250/metrics/resource\": dial tcp 10.100.100.56:10250: connect: connection refused" node="ip-10-100-100-56.ec2.internal"
E0914 09:59:33.604484       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.56:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-100-56.ec2.internal"
E0914 09:59:48.603800       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.100.56:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-100-56.ec2.internal"

my deployment test777-admin-hpa

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: test777-admin-hpa
  namespace: bart
  uid: 9ad88361-29f6-47cb-bd5f-951d82382616
  resourceVersion: '66845935'
  creationTimestamp: '2022-09-14T06:47:32Z'
  labels:
    app.kubernetes.io/managed-by: Helm
  annotations:
    meta.helm.sh/release-name: test777-bart
    meta.helm.sh/release-namespace: bart
  managedFields:
    - manager: helm
      operation: Update
      apiVersion: autoscaling/v2beta2
      time: '2022-09-14T06:47:32Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:meta.helm.sh/release-name: {}
            f:meta.helm.sh/release-namespace: {}
          f:labels:
            .: {}
            f:app.kubernetes.io/managed-by: {}
        f:spec:
          f:maxReplicas: {}
          f:metrics: {}
          f:minReplicas: {}
          f:scaleTargetRef:
            f:apiVersion: {}
            f:kind: {}
            f:name: {}
    - manager: kube-controller-manager
      operation: Update
      apiVersion: autoscaling/v1
      time: '2022-09-14T06:49:03Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:autoscaling.alpha.kubernetes.io/conditions: {}
            f:autoscaling.alpha.kubernetes.io/current-metrics: {}
        f:status:
          f:currentCPUUtilizationPercentage: {}
          f:currentReplicas: {}
          f:desiredReplicas: {}
  selfLink: >-
    /apis/autoscaling/v2beta1/namespaces/bart/horizontalpodautoscalers/test777-admin-hpa
status:
  currentReplicas: 2
  desiredReplicas: 2
  currentMetrics:
    - type: Resource
      resource:
        name: cpu
        currentAverageUtilization: 1
        currentAverageValue: 4m
  conditions:
    - type: AbleToScale
      status: 'True'
      lastTransitionTime: '2022-09-14T06:47:47Z'
      reason: ReadyForNewScale
      message: recommended size matches current size
    - type: ScalingActive
      status: 'True'
      lastTransitionTime: '2022-09-14T07:52:13Z'
      reason: ValidMetricFound
      message: >-
        the HPA was able to successfully calculate a replica count from cpu
        resource utilization (percentage of request)
    - type: ScalingLimited
      status: 'True'
      lastTransitionTime: '2022-09-14T10:24:33Z'
      reason: TooFewReplicas
      message: the desired replica count is less than the minimum replica count
spec:
  scaleTargetRef:
    kind: Deployment
    name: test777-admin
    apiVersion: apps/v1
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 60

test777-admin-hpa

sk@mac ~ % kubectl describe hpa test777-admin-hpa -n bart
Name:                                                  test777-admin-hpa
Namespace:                                             bart
Labels:                                                app.kubernetes.io/managed-by=Helm
Annotations:                                           meta.helm.sh/release-name: test777-bart
                                                       meta.helm.sh/release-namespace: bart
CreationTimestamp:                                     Wed, 14 Sep 2022 09:47:32 +0300
Reference:                                             Deployment/test777-admin
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  1% (4m) / 60%
Min replicas:                                          2
Max replicas:                                          10
Deployment pods:                                       2 current / 2 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count

my deployment test777-web-hpa

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: test777-web-hpa
  namespace: bart
  uid: 92be51b4-d9dd-4f64-8f45-355612a98717
  resourceVersion: '66802167'
  creationTimestamp: '2022-09-14T06:47:32Z'
  labels:
    app.kubernetes.io/managed-by: Helm
  annotations:
    meta.helm.sh/release-name: test777-bart
    meta.helm.sh/release-namespace: bart
  managedFields:
    - manager: helm
      operation: Update
      apiVersion: autoscaling/v2beta2
      time: '2022-09-14T06:47:32Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:meta.helm.sh/release-name: {}
            f:meta.helm.sh/release-namespace: {}
          f:labels:
            .: {}
            f:app.kubernetes.io/managed-by: {}
        f:spec:
          f:maxReplicas: {}
          f:metrics: {}
          f:minReplicas: {}
          f:scaleTargetRef:
            f:apiVersion: {}
            f:kind: {}
            f:name: {}
    - manager: kube-controller-manager
      operation: Update
      apiVersion: autoscaling/v1
      time: '2022-09-14T06:58:49Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:autoscaling.alpha.kubernetes.io/conditions: {}
            f:autoscaling.alpha.kubernetes.io/current-metrics: {}
        f:status:
          f:currentReplicas: {}
          f:desiredReplicas: {}
  selfLink: >-
    /apis/autoscaling/v2beta1/namespaces/bart/horizontalpodautoscalers/test777-web-hpa
status:
  currentReplicas: 2
  desiredReplicas: 2
  currentMetrics:
    - type: Resource
      resource:
        name: memory
        currentAverageValue: '1275105280'
    - type: ''
  conditions:
    - type: AbleToScale
      status: 'True'
      lastTransitionTime: '2022-09-14T06:47:47Z'
      reason: SucceededGetScale
      message: the HPA controller was able to get the target's current scale
    - type: ScalingActive
      status: 'False'
      lastTransitionTime: '2022-09-14T07:51:28Z'
      reason: FailedGetResourceMetric
      message: >-
        the HPA was unable to compute the replica count: failed to get cpu
        utilization: missing request for cpu
    - type: ScalingLimited
      status: 'False'
      lastTransitionTime: '2022-09-14T06:58:49Z'
      reason: DesiredWithinRange
      message: the desired count is within the acceptable range
spec:
  scaleTargetRef:
    kind: Deployment
    name: test777-web
    apiVersion: apps/v1
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 60

test777-web-hpa (the HPA with metric from metrics-server)

sk@mac ~ % kubectl describe hpa test777-web-hpa -n bart
Name:                                                  test777-web-hpa
Namespace:                                             bart
Labels:                                                app.kubernetes.io/managed-by=Helm
Annotations:                                           meta.helm.sh/release-name: test777-bart
                                                       meta.helm.sh/release-namespace: bart
CreationTimestamp:                                     Wed, 14 Sep 2022 09:47:32 +0300
Reference:                                             Deployment/test777-web
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 60%
Min replicas:                                          2
Max replicas:                                          20
Deployment pods:                                       2 current / 2 desired
Conditions:
  Type            Status  Reason                   Message
  ----            ------  ------                   -------
  AbleToScale     True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetResourceMetric  the HPA was unable to compute the replica count: failed to get cpu utilization: missing request for cpu
  ScalingLimited  False   DesiredWithinRange       the desired count is within the acceptable range
Events:
  Type     Reason                   Age                       From                       Message
  ----     ------                   ----                      ----                       -------
  Warning  FailedGetResourceMetric  4m13s (x1530 over 6h28m)  horizontal-pod-autoscaler  failed to get cpu utilization: missing request for CPU

and debug:

sk@mac ~ % kubectl top nodes
W0914 16:27:41.333865   46184 top_node.go:119] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-100-100-145.ec2.internal   91m          1%     2426Mi          16%       
ip-10-100-100-172.ec2.internal   191m         2%     10825Mi         74%       
ip-10-100-100-8.ec2.internal     254m         6%     3680Mi          55%       
ip-10-100-101-20.ec2.internal    93m          1%     4004Mi          27%       
ip-10-100-101-7.ec2.internal     103m         1%     5319Mi          36%   
sk@mac test777 % NODE_NAME=ip-10-100-101-7.ec2.internal
kubectl get --raw /api/v1/nodes/$NODE_NAME/proxy/metrics/resource
# HELP container_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the container in core-seconds
# TYPE container_cpu_usage_seconds_total counter
container_cpu_usage_seconds_total{container="agent",namespace="newrelic",pod="newrelic-nrk8s-kubelet-7sj7p"} 38.90565046 1663163146734
container_cpu_usage_seconds_total{container="aws-node",namespace="kube-system",pod="aws-node-h9glf"} 40.953685422 1663163145869
container_cpu_usage_seconds_total{container="aws-node-termination-handler",namespace="kube-system",pod="aws-node-termination-handler-bk9tf"} 5.409004909 1663163139909
container_cpu_usage_seconds_total{container="kube-proxy",namespace="kube-system",pod="kube-proxy-x6ls4"} 4.57357219 1663163140416
container_cpu_usage_seconds_total{container="kubelet",namespace="newrelic",pod="newrelic-nrk8s-kubelet-7sj7p"} 49.65748969 1663163147833
container_cpu_usage_seconds_total{container="metrics-server",namespace="kube-system",pod="metrics-server-5d76bc7fb9-4fq48"} 0.848519497 1663163137506
container_cpu_usage_seconds_total{container="newrelic-logging",namespace="newrelic",pod="newrelic-newrelic-logging-6sp87"} 78.950603304 1663163147499
container_cpu_usage_seconds_total{container="test777",namespace="voidnew",pod="test777-web-6d46d657b8-wnq56"} 213.64960915 1663163147938
container_cpu_usage_seconds_total{container="test777-admin",namespace="voidnew",pod="test777-admin-5cb67bf7c9-wlk2x"} 139.70000229 1663163150146
container_cpu_usage_seconds_total{container="test777-nginx",namespace="voidnew",pod="test777-admin-5cb67bf7c9-wlk2x"} 1.098650398 1663163142615
container_cpu_usage_seconds_total{container="test777-nginx",namespace="voidnew",pod="test777-web-6d46d657b8-wnq56"} 3.834383147 1663163136242
container_cpu_usage_seconds_total{container="test777-sidekiq-api-stats",namespace="voidnew",pod="test777-sidekiq-api-stats-6744f4d475-fqnzd"} 20.872339072 1663163135417
container_cpu_usage_seconds_total{container="test777-sidekiq-default",namespace="voidnew",pod="test777-sidekiq-default-65cdf6d68c-hpptv"} 107.027204477 1663163144046
# HELP container_memory_working_set_bytes [ALPHA] Current working set of the container in bytes
# TYPE container_memory_working_set_bytes gauge
container_memory_working_set_bytes{container="agent",namespace="newrelic",pod="newrelic-nrk8s-kubelet-7sj7p"} 2.279424e+07 1663163146734
container_memory_working_set_bytes{container="aws-node",namespace="kube-system",pod="aws-node-h9glf"} 4.93568e+07 1663163145869
container_memory_working_set_bytes{container="aws-node-termination-handler",namespace="kube-system",pod="aws-node-termination-handler-bk9tf"} 1.3234176e+07 1663163139909
container_memory_working_set_bytes{container="kube-proxy",namespace="kube-system",pod="kube-proxy-x6ls4"} 2.1225472e+07 1663163140416
container_memory_working_set_bytes{container="kubelet",namespace="newrelic",pod="newrelic-nrk8s-kubelet-7sj7p"} 1.2808192e+07 1663163147833
container_memory_working_set_bytes{container="metrics-server",namespace="kube-system",pod="metrics-server-5d76bc7fb9-4fq48"} 1.8219008e+07 1663163137506
container_memory_working_set_bytes{container="newrelic-logging",namespace="newrelic",pod="newrelic-newrelic-logging-6sp87"} 2.6406912e+07 1663163147499
container_memory_working_set_bytes{container="test777",namespace="voidnew",pod="test777-web-6d46d657b8-wnq56"} 2.015215616e+09 1663163147938
container_memory_working_set_bytes{container="test777-admin",namespace="voidnew",pod="test777-admin-5cb67bf7c9-wlk2x"} 2.112708608e+09 1663163150146
container_memory_working_set_bytes{container="test777-nginx",namespace="voidnew",pod="test777-admin-5cb67bf7c9-wlk2x"} 3.989504e+06 1663163142615
container_memory_working_set_bytes{container="test777-nginx",namespace="voidnew",pod="test777-web-6d46d657b8-wnq56"} 3.8912e+06 1663163136242
container_memory_working_set_bytes{container="test777-sidekiq-api-stats",namespace="voidnew",pod="test777-sidekiq-api-stats-6744f4d475-fqnzd"} 3.42626304e+08 1663163135417
container_memory_working_set_bytes{container="test777-sidekiq-default",namespace="voidnew",pod="test777-sidekiq-default-65cdf6d68c-hpptv"} 3.86850816e+08 1663163144046
# HELP node_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the node in core-seconds
# TYPE node_cpu_usage_seconds_total counter
node_cpu_usage_seconds_total 1927.972810339 1663163151057
# HELP node_memory_working_set_bytes [ALPHA] Current working set of the node in bytes
# TYPE node_memory_working_set_bytes gauge
node_memory_working_set_bytes 5.550145536e+09 1663163151057
# HELP pod_cpu_usage_seconds_total [ALPHA] Cumulative cpu time consumed by the pod in core-seconds
# TYPE pod_cpu_usage_seconds_total counter
pod_cpu_usage_seconds_total{namespace="kube-system",pod="aws-node-h9glf"} 41.021928699 1663163144187
pod_cpu_usage_seconds_total{namespace="kube-system",pod="aws-node-termination-handler-bk9tf"} 5.568402876 1663163151346
pod_cpu_usage_seconds_total{namespace="kube-system",pod="kube-proxy-x6ls4"} 4.582493498 1663163136940
pod_cpu_usage_seconds_total{namespace="kube-system",pod="metrics-server-5d76bc7fb9-4fq48"} 1.027940438 1663163147544
pod_cpu_usage_seconds_total{namespace="newrelic",pod="newrelic-newrelic-logging-6sp87"} 79.082432117 1663163146109
pod_cpu_usage_seconds_total{namespace="newrelic",pod="newrelic-nrk8s-kubelet-7sj7p"} 88.495799851 1663163140332
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test777-admin-5cb67bf7c9-wlk2x"} 140.750414508 1663163144358
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test777-sidekiq-api-stats-6744f4d475-fqnzd"} 21.028948976 1663163135699
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test777-sidekiq-default-65cdf6d68c-hpptv"} 107.171260406 1663163140584
pod_cpu_usage_seconds_total{namespace="voidnew",pod="test777-web-6d46d657b8-wnq56"} 217.721807448 1663163151213
# HELP pod_memory_working_set_bytes [ALPHA] Current working set of the pod in bytes
# TYPE pod_memory_working_set_bytes gauge
pod_memory_working_set_bytes{namespace="kube-system",pod="aws-node-h9glf"} 5.548032e+07 1663163144187
pod_memory_working_set_bytes{namespace="kube-system",pod="aws-node-termination-handler-bk9tf"} 1.4176256e+07 1663163151346
pod_memory_working_set_bytes{namespace="kube-system",pod="kube-proxy-x6ls4"} 2.2171648e+07 1663163136940
pod_memory_working_set_bytes{namespace="kube-system",pod="metrics-server-5d76bc7fb9-4fq48"} 1.9361792e+07 1663163147544
pod_memory_working_set_bytes{namespace="newrelic",pod="newrelic-newrelic-logging-6sp87"} 2.7332608e+07 1663163146109
pod_memory_working_set_bytes{namespace="newrelic",pod="newrelic-nrk8s-kubelet-7sj7p"} 3.9682048e+07 1663163140332
pod_memory_working_set_bytes{namespace="voidnew",pod="test777-admin-5cb67bf7c9-wlk2x"} 2.117742592e+09 1663163144358
pod_memory_working_set_bytes{namespace="voidnew",pod="test777-sidekiq-api-stats-6744f4d475-fqnzd"} 3.43740416e+08 1663163135699
pod_memory_working_set_bytes{namespace="voidnew",pod="test777-sidekiq-default-65cdf6d68c-hpptv"} 3.86633728e+08 1663163140584
pod_memory_working_set_bytes{namespace="voidnew",pod="test777-web-6d46d657b8-wnq56"} 2.01990144e+09 1663163151213
# HELP scrape_error [ALPHA] 1 if there was an error while getting container metrics, 0 otherwise
# TYPE scrape_error gauge
scrape_error 0
sk@mac talkable % kubectl get --raw /api/v1/nodes/$NODE_NAME/proxy/stats/summary | jq '{cpu: .node.cpu, memory: .node.memory}'
{
  "cpu": {
    "time": "2022-09-14T13:47:01Z",
    "usageNanoCores": 100871342,
    "usageCoreNanoSeconds": 1948665238173
  },
  "memory": {
    "time": "2022-09-14T13:47:01Z",
    "availableBytes": 10741108736,
    "usageBytes": 10238730240,
    "workingSetBytes": 5561221120,
    "rssBytes": 5365252096,
    "pageFaults": 3195654,
    "majorPageFaults": 924
  }
}
[root@ip-10-100-101-7 /]# mount | grep group
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)

image

Also cannot understand why I still getting some protocol and scraper errors in metrics-server logs. Like this:

E0914 04:32:03.603806       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.165:10250/metrics/resource\": context deadline exceeded" node="ip-10-100-101-165.ec2.internal"

E0914 04:31:20.140554       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.100.101.20:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-100-101-20.ec2.internal"

http2: server connection error from 10.100.100.79:46480: connection error: PROTOCOL_ERROR

From my point of view - it looks like the metrics server couldn't handle the CPU sharing for two HPAs or can do it partially, on and off. Is it possible to test my theory?

Appreciate any help guys.

/kind support

serg4kostiuk avatar Sep 14 '22 13:09 serg4kostiuk

Hi, @serg4kostiuk, Thanks for the feedback. The prerequisite for metrics-server to work properly is that the metrics/resource endpoint of nodes can be accessed normally. But judging from the metrics-server logs provided, there is a probability of failure to access the metrics/resource endpoints of some nodes. There are four main types of errors

  • PROTOCOL_ERROR node 10.100.100.79 and node 10.100.101.182

  • remote error: tls: internal error" node 10.100.101.20 10.100.101.234 and 10.100.101.7

  • connect: connection refused node 10.100.101.165 and 10.100.100.56

  • context deadline exceeded node 10.100.101.165 10.100.101.234 and 10.100.100.56

I am not clear about the status of the cluster. Maybe there is a problem with the network status?

yangjunmyfm192085 avatar Sep 15 '22 03:09 yangjunmyfm192085

I'm having a similar problem. When I first start the cluster the metrics-server works fine. I can do stuff like kubectl top nodes / pods and also the HPA configurations seem to work.

The problem begins when new nodes are created. The metrics-server loses track of new nodes. When I do kubectl top nodes I get unknown values for the new added node.

I also noticed through the logs, that the metrics server tries to reach the new node in a different port, for example the first working node is at port 4443, but the new node fails at port 10250.

dirodriguezm avatar Oct 13 '22 20:10 dirodriguezm

I'm having a similar problem. When I first start the cluster the metrics-server works fine. I can do stuff like kubectl top nodes / pods and also the HPA configurations seem to work.

Any steps can reproduce it? what's version about k8s and metrics-server.

liangyuanpeng avatar Oct 20 '22 06:10 liangyuanpeng

I've resolved the issue.

The issue was that I'd set labels and annotations for cronJob and Jobs accordingly. In HPA we have set a reference to the scaled object - scaleTargetRef-kind- Deployment. But for some reason, HPA picks up additional resources by label (including cronJobs, Jobs) and as these resources cannot be scaled here is the error. So, the following changes were made for cronJob and Job kinds:

Glad if it helps. Good luck!

serg4kostiuk avatar Jan 03 '23 11:01 serg4kostiuk