kube-prometheus one of five nodes always metrics down

Get "https://x.x.x.58:10250/metrics": context deadline exceeded

Sometimes 58 node04

Sometimes 56 node02

One of all node must be down

Oct 27 '22 07:10 sd3428783

Got a similar error of the node_exporter deployment. The kube-rbac-proxy get "context deadline exceeded". After the pod gets restarted, it works again.

Oct 27 '22 13:10 faabsen

Got a similar error of the node_exporter deployment. The kube-rbac-proxy get "context deadline exceeded". After the pod gets restarted, it works again.

hi,brother, is kube-rbac-proxy get a pod? I can not find. can i use "delete pod" to restart?

Oct 28 '22 08:10 sd3428783

So when using kube-prometheus the node_exporter deployment is defining a sidecar with a kube-rbac-proxy container. However, it looks like my error is releated to node_exporter#2500

It looks like a kernel version is causing problems:

Similar issue on Amazon Linux 2 5.4.214-120 whereas 5.4.209-116 works - just hangs with nothing in logs & not responding to probes/healthchecks, so gets restarted by Kubernetes. Hence can't debug much either...

Regarding your question:

is kube-rbac-proxy get a pod? can i use "delete pod" to restart?

It is defined in the pod as a sidecar (another container). You can either kill the pod or restart the entire deployment/daemonset

Oct 28 '22 09:10 faabsen

I also appear node_ One node of the exporter target down. My kernel is 5.4.188. It is doubted that the kernel has an impact, but it will also occur when it is upgraded to 5.4.225

Dec 02 '22 01:12 myname855

I also have the same problem, please ask is this a bug?

Dec 02 '22 18:12 wyq1229409654

We're also seeing this on prometheus when scraping external metrics and cadvisor.

E.g. targets:

https://.../api/v1/nodes/ip-10-10-10-10/proxy/metrics https://.../api/v1/nodes/ip-10-10-10-10/proxy/metrics/cadvisor

2 of 8 nodes is always context deadline exceeded. Not always the same one.

Very inconsistent behaviour. Trying to work out what's going on.

Seems to be something to do with the Amazon Linux version the Prometheus container is running on: Amazon Linux 2 5.4.228-131.415.amzn2.x86_64

Feb 10 '23 06:02 dcarrion87

尝试修改node-exporter-ds 文件

Feb 10 '23 06:02 wyq1229409654

No difference with GOMAXPROCS = 1

Seems to have only started since we moved from ECS to EKS.

And if I delete a node and get a new one sometimes it's not an issue.

How bizarre.

What kernel are you using? There are some kernels causing troubles

Feb 10 '23 10:02 faabsen

kube-prometheus kube-prometheus copied to clipboard

one of five nodes always metrics down

kube-prometheus
kube-prometheus copied to clipboard