amazon-vpc-cni-k8s
amazon-vpc-cni-k8s copied to clipboard
CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint"
I am having a similar issue as reported in https://github.com/aws/amazon-vpc-cni-k8s/issues/1912
I installed the cni-metric-helper via the helm chart, 1.18.5 1.30 EKS cluster, all the addons pretty must updated to at least a week ago
My logs are showing a failure when it attempts to pull the metrics from the aws-node pods
{"level":"info","ts":"2024-10-14T20:06:11.547Z","caller":"cni-metrics-helper/main.go:69","msg":"Constructed new logger instance"}
{"level":"info","ts":"2024-10-14T20:06:11.548Z","caller":"runtime/proc.go:271","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: false, Prometheus: true, LogLevel DEBUG, me
tricUpdateInterval 30"}
{"level":"info","ts":"2024-10-14T20:06:41.588Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}
{"level":"info","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}
{"level":"debug","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}
{"level":"error","ts":"2024-10-14T20:08:51.287Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-n929t:61678)"}
{"level":"error","ts":"2024-10-14T20:11:02.359Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-xlz6m:61678)"}
{"level":"error","ts":"2024-10-14T20:13:13.431Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-6kvmk:61678)"}
{"level":"error","ts":"2024-10-14T20:15:24.503Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-8gnpw:61678)"}
{"level":"error","ts":"2024-10-14T20:17:35.575Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-fj6n5:61678)"}
{"level":"info","ts":"2024-10-14T20:17:35.575Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}
{"level":"info","ts":"2024-10-14T20:17:35.576Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}
{"level":"debug","ts":"2024-10-14T20:17:35.576Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}
{"level":"error","ts":"2024-10-14T20:19:46.647Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-6kvmk:61678)"}
{"level":"error","ts":"2024-10-14T20:21:57.719Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-8gnpw:61678)"}
My config for helm is
env:
USE_CLOUDWATCH: "false"
USE_PROMETHEUS: "true"
AWS_VPC_K8S_CNI_LOGLEVEL: "DEBUG"
Other than that, there is little other config. The helm targeted the kube-system namespace.
The cluster role binding seems correct
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cni-metrics-helper
subjects:
- kind: ServiceAccount
name: cni-metrics-helper
namespace: kube-system
the sa is in the right place
$ k get sa -n kube-system cni-metrics-helper
NAME SECRETS AGE
cni-metrics-helper 0 48m
Traced the call down to https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/cni-metrics-helper/metrics/metrics.go#L89
rawOutput, err := k8sClient.CoreV1().RESTClient().Get().
Namespace(namespace).
Resource("pods").
SubResource("proxy").
Name(fmt.Sprintf("%v:%v", podName, port)).
Suffix("metrics").
Do(ctx).Raw()
We have istio installed on this cluster, but it's not in the kube-system namespace.
There was talk about needing to set the REGION and cluster in the other issue. I did that manually to see if it would help, and no dice
- name: AWS_CLUSTER_ID
value: k8s-wl-snd-use1-default
- name: AWS_REGION
value: us-east-1
- name: AWS_VPC_K8S_CNI_LOGLEVEL
value: DEBUG
- name: USE_CLOUDWATCH
value: 'false'
- name: USE_PROMETHEUS
value: 'true'
There isn't a security group that prevents any inter-node communications > port 1024
Thanks!