kube-prometheus icon indicating copy to clipboard operation
kube-prometheus copied to clipboard

one of five nodes always metrics down

Open sd3428783 opened this issue 1 year ago • 8 comments

Get "https://x.x.x.58:10250/metrics": context deadline exceeded

Sometimes 58 node04

Sometimes 56 node02

One of all node must be down

sd3428783 avatar Oct 27 '22 07:10 sd3428783

Got a similar error of the node_exporter deployment. The kube-rbac-proxy get "context deadline exceeded". After the pod gets restarted, it works again.

faabsen avatar Oct 27 '22 13:10 faabsen

Got a similar error of the node_exporter deployment. The kube-rbac-proxy get "context deadline exceeded". After the pod gets restarted, it works again.

hi,brother, is kube-rbac-proxy get a pod? I can not find. can i use "delete pod" to restart?

sd3428783 avatar Oct 28 '22 08:10 sd3428783

So when using kube-prometheus the node_exporter deployment is defining a sidecar with a kube-rbac-proxy container. However, it looks like my error is releated to node_exporter#2500

It looks like a kernel version is causing problems:

Similar issue on Amazon Linux 2 5.4.214-120 whereas 5.4.209-116 works - just hangs with nothing in logs & not responding to probes/healthchecks, so gets restarted by Kubernetes. Hence can't debug much either...

Regarding your question:

is kube-rbac-proxy get a pod? can i use "delete pod" to restart?

It is defined in the pod as a sidecar (another container). You can either kill the pod or restart the entire deployment/daemonset

faabsen avatar Oct 28 '22 09:10 faabsen

I also appear node_ One node of the exporter target down. My kernel is 5.4.188. It is doubted that the kernel has an impact, but it will also occur when it is upgraded to 5.4.225

myname855 avatar Dec 02 '22 01:12 myname855

I also have the same problem, please ask is this a bug?

wyq1229409654 avatar Dec 02 '22 18:12 wyq1229409654

We're also seeing this on prometheus when scraping external metrics and cadvisor.

E.g. targets:

https://.../api/v1/nodes/ip-10-10-10-10/proxy/metrics https://.../api/v1/nodes/ip-10-10-10-10/proxy/metrics/cadvisor

2 of 8 nodes is always context deadline exceeded. Not always the same one.

Very inconsistent behaviour. Trying to work out what's going on.

Seems to be something to do with the Amazon Linux version the Prometheus container is running on: Amazon Linux 2 5.4.228-131.415.amzn2.x86_64

dcarrion87 avatar Feb 10 '23 06:02 dcarrion87

尝试修改node-exporter-ds 文件 image

wyq1229409654 avatar Feb 10 '23 06:02 wyq1229409654

No difference with GOMAXPROCS = 1

Seems to have only started since we moved from ECS to EKS.

And if I delete a node and get a new one sometimes it's not an issue.

How bizarre.

What kernel are you using? There are some kernels causing troubles

faabsen avatar Feb 10 '23 10:02 faabsen