kube-prometheus
kube-prometheus copied to clipboard
one of five nodes always metrics down
Get "https://x.x.x.58:10250/metrics": context deadline exceeded
Sometimes 58 node04
Sometimes 56 node02
One of all node must be down
Got a similar error of the node_exporter deployment. The kube-rbac-proxy get "context deadline exceeded". After the pod gets restarted, it works again.
Got a similar error of the node_exporter deployment. The kube-rbac-proxy get "context deadline exceeded". After the pod gets restarted, it works again.
hi,brother, is kube-rbac-proxy get a pod? I can not find. can i use "delete pod" to restart?
So when using kube-prometheus
the node_exporter
deployment is defining a sidecar with a kube-rbac-proxy
container. However, it looks like my error is releated to node_exporter#2500
It looks like a kernel version is causing problems:
Similar issue on Amazon Linux 2 5.4.214-120 whereas 5.4.209-116 works - just hangs with nothing in logs & not responding to probes/healthchecks, so gets restarted by Kubernetes. Hence can't debug much either...
Regarding your question:
is kube-rbac-proxy get a pod? can i use "delete pod" to restart?
It is defined in the pod as a sidecar (another container). You can either kill the pod or restart the entire deployment/daemonset
I also appear node_ One node of the exporter target down. My kernel is 5.4.188. It is doubted that the kernel has an impact, but it will also occur when it is upgraded to 5.4.225
I also have the same problem, please ask is this a bug?
We're also seeing this on prometheus when scraping external metrics and cadvisor.
E.g. targets:
https://.../api/v1/nodes/ip-10-10-10-10/proxy/metrics https://.../api/v1/nodes/ip-10-10-10-10/proxy/metrics/cadvisor
2 of 8 nodes is always context deadline exceeded. Not always the same one.
Very inconsistent behaviour. Trying to work out what's going on.
Seems to be something to do with the Amazon Linux version the Prometheus container is running on: Amazon Linux 2 5.4.228-131.415.amzn2.x86_64
尝试修改node-exporter-ds 文件
No difference with GOMAXPROCS = 1
Seems to have only started since we moved from ECS to EKS.
And if I delete a node and get a new one sometimes it's not an issue.
How bizarre.
What kernel are you using? There are some kernels causing troubles