amazon-vpc-cni-k8s icon indicating copy to clipboard operation
amazon-vpc-cni-k8s copied to clipboard

CNIMetricsHelper erroring on polling pods. "Failed to grab CNI endpoint"

Open taer opened this issue 4 months ago • 0 comments

I am having a similar issue as reported in https://github.com/aws/amazon-vpc-cni-k8s/issues/1912

I installed the cni-metric-helper via the helm chart, 1.18.5 1.30 EKS cluster, all the addons pretty must updated to at least a week ago

My logs are showing a failure when it attempts to pull the metrics from the aws-node pods

{"level":"info","ts":"2024-10-14T20:06:11.547Z","caller":"cni-metrics-helper/main.go:69","msg":"Constructed new logger instance"}                                                          
{"level":"info","ts":"2024-10-14T20:06:11.548Z","caller":"runtime/proc.go:271","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: false, Prometheus: true, LogLevel DEBUG, me
tricUpdateInterval 30"}                                                                                                                                                                    
{"level":"info","ts":"2024-10-14T20:06:41.588Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}                                                                             
{"level":"info","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}                                                                 
{"level":"debug","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}                                                                  
{"level":"error","ts":"2024-10-14T20:08:51.287Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-n929t:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:11:02.359Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-xlz6m:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:13:13.431Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-6kvmk:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:15:24.503Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-8gnpw:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:17:35.575Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-fj6n5:61678)"}                                                                                                                                                  
{"level":"info","ts":"2024-10-14T20:17:35.575Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}                                                                             
{"level":"info","ts":"2024-10-14T20:17:35.576Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}                                                                 
{"level":"debug","ts":"2024-10-14T20:17:35.576Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}                                                                  
{"level":"error","ts":"2024-10-14T20:19:46.647Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-6kvmk:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:21:57.719Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-8gnpw:61678)"}        

My config for helm is

env:
  USE_CLOUDWATCH: "false"
  USE_PROMETHEUS: "true"
  AWS_VPC_K8S_CNI_LOGLEVEL: "DEBUG"

Other than that, there is little other config. The helm targeted the kube-system namespace.

The cluster role binding seems correct

roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cni-metrics-helper
subjects:
  - kind: ServiceAccount
    name: cni-metrics-helper
    namespace: kube-system

the sa is in the right place

$ k get sa -n kube-system cni-metrics-helper
NAME                 SECRETS   AGE
cni-metrics-helper   0         48m

Traced the call down to https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/cni-metrics-helper/metrics/metrics.go#L89

	rawOutput, err := k8sClient.CoreV1().RESTClient().Get().
		Namespace(namespace).
		Resource("pods").
		SubResource("proxy").
		Name(fmt.Sprintf("%v:%v", podName, port)).
		Suffix("metrics").
		Do(ctx).Raw()

We have istio installed on this cluster, but it's not in the kube-system namespace.

There was talk about needing to set the REGION and cluster in the other issue. I did that manually to see if it would help, and no dice

        - name: AWS_CLUSTER_ID
          value: k8s-wl-snd-use1-default
        - name: AWS_REGION
          value: us-east-1
        - name: AWS_VPC_K8S_CNI_LOGLEVEL
          value: DEBUG
        - name: USE_CLOUDWATCH
          value: 'false'
        - name: USE_PROMETHEUS
          value: 'true'

There isn't a security group that prevents any inter-node communications > port 1024

Thanks!

taer avatar Oct 14 '24 20:10 taer