metrics-server icon indicating copy to clipboard operation
metrics-server copied to clipboard

EKS Fargate Matrics-server fails to scrape itself

Open Paddy-CH opened this issue 1 year ago • 9 comments

What happened: Logs from the matrics-server pod show this repeatedly E0216 11:45:59.265624 1 scraper.go:149] "Failed to scrape node" err="Get "https://10.6.194.69:10250/metrics/resource": dial tcp 10.6.194.69:10250: connect: connection refused" node="fargate-ip-10-6-194-69.eu-west-2.compute.internal"

What you expected to happen: To be able to scrape itself.

Anything else we need to know?: The secure port and container port are set to 4443. If I change it to 10250 as the call requires the error changes to 'forbidden'. I also get 'error: Metrics API not available' from kubectl when I try to access it.

Environment:

  • Kubernetes distribution EKS Fargate

  • Kubernetes version 1.29

  • Metrics Server manifest

spoiler for Metrics Server manifest:

apiVersion: v1 kind: ServiceAccount metadata: labels: k8s-app: metrics-server name: metrics-server namespace: kube-system

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: k8s-app: metrics-server rbac.authorization.k8s.io/aggregate-to-admin: "true" rbac.authorization.k8s.io/aggregate-to-edit: "true" rbac.authorization.k8s.io/aggregate-to-view: "true" name: system:aggregated-metrics-reader rules:

  • apiGroups:
    • metrics.k8s.io resources:
    • pods
    • nodes verbs:
    • get
    • list
    • watch

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: k8s-app: metrics-server name: system:metrics-server rules:

  • apiGroups:
    • "" resources:
    • nodes/metrics verbs:
    • get
  • apiGroups:
    • "" resources:
    • pods
    • nodes verbs:
    • get
    • list
    • watch

apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: labels: k8s-app: metrics-server name: metrics-server-auth-reader namespace: kube-system roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: extension-apiserver-authentication-reader subjects:

  • kind: ServiceAccount name: metrics-server namespace: kube-system

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: k8s-app: metrics-server name: metrics-server:system:auth-delegator roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:auth-delegator subjects:

  • kind: ServiceAccount name: metrics-server namespace: kube-system

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: k8s-app: metrics-server name: system:metrics-server roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:metrics-server subjects:

  • kind: ServiceAccount name: metrics-server namespace: kube-system

apiVersion: v1 kind: Service metadata: labels: k8s-app: metrics-server name: metrics-server namespace: kube-system spec: ports:

  • name: https port: 443 protocol: TCP targetPort: https selector: k8s-app: metrics-server

apiVersion: apps/v1 kind: Deployment metadata: labels: k8s-app: metrics-server name: metrics-server namespace: kube-system spec: selector: matchLabels: k8s-app: metrics-server strategy: rollingUpdate: maxUnavailable: 0 template: metadata: labels: k8s-app: metrics-server spec: containers: - args: - --cert-dir=/tmp - --secure-port=4443 - --kubelet-insecure-tls - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --kubelet-use-node-status-port - --metric-resolution=15s command: - /metrics-server - --kubelet-insecure-tls - --kubelet-preferred-address-types=InternalIP image: registry.k8s.io/metrics-server/metrics-server:v0.7.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /livez port: https scheme: HTTPS periodSeconds: 10 name: metrics-server ports: - containerPort: 4443 name: https protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /readyz port: https scheme: HTTPS initialDelaySeconds: 20 periodSeconds: 10 resources: requests: cpu: 100m memory: 200Mi limits: cpu: 100m memory: 200Mi securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true runAsNonRoot: true runAsUser: 1000 volumeMounts: - mountPath: /tmp name: tmp-dir nodeSelector: kubernetes.io/os: linux priorityClassName: system-cluster-critical serviceAccountName: metrics-server volumes: - emptyDir: {} name: tmp-dir

apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: labels: k8s-app: metrics-server name: v1beta1.metrics.k8s.io spec: group: metrics.k8s.io groupPriorityMinimum: 100 insecureSkipTLSVerify: true service: name: metrics-server namespace: kube-system version: v1beta1 versionPriority: 100

  • Kubelet config:
spoiler for Kubelet config:
  • Metrics server logs:
spoiler for Metrics Server logs:
  • Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io

/kind bug

Paddy-CH avatar Feb 16 '24 12:02 Paddy-CH

"Failed to scrape node" err="Get "[https://10.6.194.69:10250/metrics/resource\](https://10.6.194.69:10250/metrics/resource%5C)": This error represents an exception when metrics-server accesses the metrics/resource endpoint of kubelet. Please check whether the firewall blocks access to the kubelet 10250 port, or is the kubelet listening port not 10250?

yangjunmyfm192085 avatar Feb 19 '24 00:02 yangjunmyfm192085

Hi, Initially I had it set to 4443. When I saw the error I changed it to 10250. When I did that the error changed to a 'forbidden' error when trying to scrape itself, also when I tried kubectl I got 'Metrics API not available'

Paddy-CH avatar Feb 19 '24 09:02 Paddy-CH

Could you use the command kubectl get node fargate-ip-10-6-194-69.eu-west-2.compute.internal -oyaml to check the value of kubeletEndpoint?

yangjunmyfm192085 avatar Feb 19 '24 10:02 yangjunmyfm192085

It returns daemonEndpoints: kubeletEndpoint: Port: 10250

Paddy-CH avatar Feb 19 '24 10:02 Paddy-CH

Hi, @Paddy-CH, It looks like the EKS environment, metrics-server cannot access the kubelet's 10250 port normally. This should not be a issue with metrics-server. Please also check the security policy of the environment?

yangjunmyfm192085 avatar Feb 20 '24 00:02 yangjunmyfm192085

/kind support

yangjunmyfm192085 avatar Feb 20 '24 00:02 yangjunmyfm192085

/remove-kind bug

yangjunmyfm192085 avatar Feb 20 '24 00:02 yangjunmyfm192085

/assign @yangjunmyfm192085 /triage accepted

dashpole avatar Feb 22 '24 17:02 dashpole

Related to https://github.com/aws/containers-roadmap/issues/1798

honarkhah avatar Apr 19 '24 15:04 honarkhah