prometheus-adapter
prometheus-adapter copied to clipboard
Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping
What happened?: kubectl top nodes not giving proper response. Error out with - error: metrics not available yet
What did you expect to happen?: Should return proper response
Please provide the prometheus-adapter config:
prometheus-adapter config
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2024-02-16T02:52:24Z"
generation: 1
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/instance: pipeline-monitoring-b9cb893b
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.11.2
name: prometheus-adapter
namespace: monitoring-system
resourceVersion: "40411"
uid: fe95f92d-788d-494b-8579-06dda771455c
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
annotations:
checksum.config/md5: 3b1ebf7df0232d1675896f67b66373db
creationTimestamp: null
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.11.2
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
namespaces:
- monitoring-system
topologyKey: kubernetes.io/hostname
weight: 100
automountServiceAccountToken: true
containers:
- args:
- --cert-dir=/var/run/serving-cert
- --config=/etc/adapter/config.yaml
- --metrics-relist-interval=1m
- --prometheus-url=http://prometheus-k8s.monitoring-system.svc:9090/
- --secure-port=6443
- --tls-cipher-suites= XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.11.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /livez
port: https
scheme: HTTPS
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: prometheus-adapter
ports:
- containerPort: 6443
name: https
protocol: TCP
readinessProbe:
failureThreshold: 5
httpGet:
path: /readyz
port: https
scheme: HTTPS
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 250m
memory: 180Mi
requests:
cpu: 102m
memory: 180Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
startupProbe:
failureThreshold: 18
httpGet:
path: /livez
port: https
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp
name: tmpfs
- mountPath: /var/run/serving-cert
name: volume-serving-cert
- mountPath: /etc/adapter
name: config
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: prometheus-adapter
serviceAccountName: prometheus-adapter
terminationGracePeriodSeconds: 30
volumes:
- emptyDir: {}
name: tmpfs
- emptyDir: {}
name: volume-serving-cert
- configMap:
defaultMode: 420
name: adapter-config
name: config
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2024-02-16T02:52:24Z"
lastUpdateTime: "2024-02-16T02:54:10Z"
message: ReplicaSet "prometheus-adapter-fc7bc9c4d" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2024-02-16T03:22:06Z"
lastUpdateTime: "2024-02-16T03:22:06Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 1
readyReplicas: 2
replicas: 2
updatedReplicas: 2
Please provide the HPA resource used for autoscaling:
HPA yaml
Please provide the HPA status:
Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:
prometheus-adapter logs
I0215 04:38:57.171364 1 handler.go:143] prometheus-metrics-adapter: GET "/apis/metrics.k8s.io/v1beta1/nodes" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0215 04:38:57.174856 1 api.go:88] GET http://prometheus-k8s.monitoring-system.svc:9090/api/v1/query?query=sum+by+%28node%29+%28%0A++1+-+irate%28%0A++++node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B60s%5D%0A++%29%0A++%2A+on%28namespace%2C+pod%29+group_left%28node%29+%28%0A++++node_namespace_pod%3Akube_pod _info%3A%7Bnode%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.co mpute.internal%22%7D%0A++%29%0A%29%0Aor+sum+by+%28node%29+%28%0A++1+-+irate%28%0A++++windows_cpu_time_total%7Bmode%3D%22idle%22%2C+job%3D%22windows-exporter%22%2Cnode%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%5B4m%5D%0A++%29%0A%29%0A&time=1707971937.171 200 OK
I0215 04:38:57.174974 1 api.go:107] Response Body: {"status":"success","data":{"resultType":"vector","result":[]}}
I0215 04:38:57.178115 1 api.go:88] GET http://prometheus-k8s.monitoring-system.svc:9090/api/v1/query?query=sum+by+%28instance%29+%28%0A++node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.interna l%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A++-%0A++node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%2Cinstance%3D~%22ip-10-161-218- 141.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A%29%0Aor+sum+ by+%28instance%29+%28%0A++windows_cs_physical_memory_bytes%7Bjob%3D%22windows-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Ci p-10-161-216-168.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A++-%0A++windows_memory_available_bytes%7Bjob%3D%22windows-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx .ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A%29%0A&time=1707971937.171 200 OK
I0215 04:38:57.178231 1 api.go:107] Response Body: {"status":"success","data":{"resultType":"vector","result":[{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1563303936"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value": [1707971937.171,"3037470720"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"2406985728"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1566240768"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.c ompute.internal"},"value":[1707971937.171,"3543482368"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1575890944"]}]}}
I0215 04:38:57.178655 1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178670 1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178676 1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178682 1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178688 1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178694 1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178918 1 httplog.go:132] "HTTP" verb="LIST" URI="/apis/metrics.k8s.io/v1beta1/nodes" latency="8.096979ms" userAgent="kubectl/v1.26.7 (linux/amd64) kubernetes/89a3d86" audit-ID="4d2e7d1c-5107-4e3a-8caa-67c0c4e2ff6f" srcIP="10.161.221.68:53746" resp=200
I0215 04:38:57.221383 1 handler.go:153] prometheus-metrics-adapter: GET "/livez" satisfied by nonGoRestful
I0215 04:38:57.221410 1 pathrecorder.go:241] prometheus-metrics-adapter: "/livez" satisfied by exact match
I0215 04:38:57.221511 1 handler.go:153] prometheus-metrics-adapter: GET "/readyz" satisfied by nonGoRestful
I0215 04:38:57.221539 1 pathrecorder.go:241] prometheus-metrics-adapter: "/readyz" satisfied by exact match
I0215 04:38:57.221513 1 httplog.go:132] "HTTP" verb="GET" URI="/livez" latency="238.051µs" userAgent="kube-probe/1.26+" audit-ID="ac3d0baa-550b-4ea1-a156-563e84503bf4" srcIP="10.161.218.141:24110" resp=200
I0215 04:38:57.221592 1 shared_informer.go:341] caches populated
I0215 04:38:57.221668 1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="255.747µs" userAgent="kube-probe/1.26+" audit-ID="0813d31c-df53-4066-b23d-0edea4e8db04" srcIP="10.161.218.141:24112" resp=200
Anything else we need to know?:
Environment:
- prometheus-adapter version: v0.11.2
- prometheus version: v0.71.2
- Kubernetes version (use
kubectl version
): 1.26 - Cloud provider or hardware configuration: AWS EKS
- Other info:
/assign @dgrisonnet /triage accepted
Hi, I also encountered a similar problem. (I referred to this, so strictly speaking it's a bit different.)
I think you are using this ConfigMap
. If so, node_cpu_usage_seconds_total
and node_memory_working_set_bytes
used there are metrics exposed by the kubelet's /metrics/resource
endpoint, that is relatively new.
Is this endpoint scraped by Prometheus? If not, NodeMetrics
cannot be generated.
On the other hand, it seems that PodMetrics
has been obtained, so I think that /metrics/cadvisor
has already been scraped by Prometheus. However, if both /metrics/cadvisor
and /metrics/resource
are scraped, this ConfigMap
will generate incorrect PodMetrics
values.
This is because container_cpu_usage_seconds_total
and container_memory_working_set_bytes
used there are exposed by both endpoints, resulting in double counting. Therefore, you will need to take measures such as assigning a metrics_path
label using relabel_configs
and narrowing it down to one.
As an alternative solution to avoid using /metrics/resource
, you could follow the approach mentioned here(I used it). However, in that case, you'll need Prometheus Node Exporter, and you'll also need to assign node names to node
label using relabel_configs
.
Hi, I had similar issue with CPU metrics for the nodes. This solution helped me. ('node' attribute is not added in the node_cpu_seconds_total metric)