kube-prometheus
kube-prometheus copied to clipboard
node-exporter TargetDown
What did you do?
Installed prometheus-operator
using Helm chart found here: https://github.com/helm/charts/tree/master/stable/prometheus-operator . My GKE test cluster uses preemptible nodes and after nodes are preemptied I start getting alerts from Prometheus that node-exporter targets are down even though all node-exporters are up and running (I see metrics when port-forwarding to them)
Labels | State | Active Since | Value
-- | -- | -- | --
alertname="TargetDown" job="node-exporter" severity="warning" | firing | 2018-11-21 10:46:39.941756819 +0000 UTC | 66.66666666666666
❯ kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-monitoring-prometheus-oper-alertmanager-0 2/2 Running 0 3h
monitoring-grafana-679df6bd5-vn2wz 3/3 Running 4 1d
monitoring-kube-state-metrics-764d6d59df-k7829 1/1 Running 0 3h
monitoring-prometheus-node-exporter-4nnjq 1/1 Running 0 1d
monitoring-prometheus-node-exporter-ggvrs 1/1 Running 0 21h
monitoring-prometheus-node-exporter-hggzv 1/1 Running 0 21h
monitoring-prometheus-node-exporter-jwxcn 1/1 Running 0 1d
monitoring-prometheus-oper-operator-55564b6cbb-rdzqm 1/1 Running 0 2h
prometheus-monitoring-prometheus-oper-prometheus-0 3/3 Running 0 2h
What did you expect to see?
After nodes are removed/preemptied and new nodes are added node-exporter targets are refreshed and always pick up new nodes
What did you see instead? Under which circumstances? At the moment I have 4 nodes in my cluster but metrics in Prometheus are only available for 2 of them
Environment
-
Prometheus Operator version:
quay.io/coreos/prometheus-operator:v0.25.0
-
Kubernetes version information:
kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.0", GitCommit:"0ed33881dc4355495f623c6f22e7dd0b7632b7c0", GitTreeState:"clean", BuildDate:"2018-09-27T17:05:32Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.11", GitCommit:"fa90543563c9cfafca69128ce8cd9ecd5941940f", GitTreeState:"clean", BuildDate:"2018-11-08T20:22:21Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
-
Kubernetes cluster kind:
GKE cluster created with terraform
-
Manifests:
insert manifests relevant to the issue
- Prometheus Operator Logs:
haven't found any specific errors
~@hpio we just merged a PR last night that should fix this: https://github.com/coreos/prometheus-operator/pull/2146~
Sorry, this is a different issue.
Can anyone give some input? I'm not sure if that's a bug or possibly configuration issue. Happy to provide more info if required
I am using same kubernetes version and hit same error when I add new nodes. it always alert me rbac about apiserver in node-exporter pod.
Can you share what you see on the /targets
page of the Prometheus UI?
Endpoint | State | Labels | Last Scrape | Error
-- | -- | -- | -- | --
http://172.21.1.2:9100/metrics | up | endpoint="metrics" instance="172.21.1.2:9100" namespace="monitoring" pod="monitoring-prometheus-node-exporter-bkb7k" service="monitoring-prometheus-node-exporter" | 8.432s ago |
http://172.21.1.3:9100/metrics | down | endpoint="metrics" instance="172.21.1.3:9100" namespace="monitoring" pod="monitoring-prometheus-node-exporter-fztb2" service="monitoring-prometheus-node-exporter" | 22.145s ago | context deadline exceeded
http://172.21.1.4:9100/metrics | up | endpoint="metrics" instance="172.21.1.4:9100" namespace="monitoring" pod="monitoring-prometheus-node-exporter-52v44" service="monitoring-prometheus-node-exporter" | 19.288s ago |
http://172.21.1.6:9100/metrics | up | endpoint="metrics" instance="172.21.1.6:9100" namespace="monitoring" pod="monitoring-prometheus-node-exporter-h867x" service="monitoring-prometheus-node-exporter" | 21.918s ago |
http://172.21.1.7:9100/metrics | up | endpoint="metrics" instance="172.21.1.7:9100" namespace="monitoring" pod="monitoring-prometheus-node-exporter-957sq" service="monitoring-prometheus-node-exporter" | 12.433s ago |
Target that is being reported as Down no longer exists, there's no node with that ip at the moment.
Nodes in that specific cluster have the following ip addresses: 172.21.1.2 172.21.1.4 172.21.1.5 172.21.1.6 172.21.1.7
As you can see Prometheus has not discovered node 172.21.1.5 but it keeps record of non-existing 172.21.1.3
Which version of Prometheus are you running? I know there were a couple of versions where target updating infrastructure had some deadlocks and/or were lagging behind.
Prometheus build 2.4.3 Prometheus operator: quay.io/coreos/prometheus-operator:v0.25.0
could you try the latest release candidate to see if this is fixed? That would be v2.6.0-rc.0
.
Will send an update after the weekend, thanks
Hi branch , I have updated Prometheus to latest stable 2.5.0 as 2.6.0-rc-0 was crashlooping (i know there's a fix to be merged to make it work).
With latest stable I still see the issue, at the moment 2 out of 5 node exporters are being reported down after nodes were preepmtied
This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.
Are you still seeing this issue?
I still recognise issues similar to that.
version 2.9.0
http://10.132.15.225:9100/metrics | DOWN | addonmanager_kubernetes_io_mode="Reconcile" instance="10.132.15.225:9100" job="kubernetes-service-endpoints" kubernetes_io_cluster_service="true" kubernetes_io_name="NodeExporter" kubernetes_name="node-exporter" kubernetes_namespace="monitoring" | 17.866s ago | 10s
-- | -- | -- | -- | -- | --
http://10.132.15.226:9100/metrics | DOWN | addonmanager_kubernetes_io_mode="Reconcile" instance="10.132.15.226:9100" job="kubernetes-service-endpoints" kubernetes_io_cluster_service="true" kubernetes_io_name="NodeExporter" kubernetes_name="node-exporter" kubernetes_namespace="monitoring" | 17.867s ago | 10s
10.132.15.225,10.132.15.226 are no longer available in the gke cluster
This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.
/remove-lifecycle stale
Anyone got a workaround for this ?
@vfiset what is the actual error you are seeing in the Prometheus targets page for the node-exporter instances in question?
@brancz this alert is triggering each time a node is repalced:
- alert: KubeletDown
annotations:
message: Kubelet has disappeared from Prometheus target discovery.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown
expr: |
absent(up{job="kubelet", metrics_path="/metrics"} == 1)
for: 15m
labels:
severity: critical
The same happens on GKE when using regular nodes and a gcp maintenance occurs and one or many nodes are replaced.
Not quite sure what should I do. I guess everyone on public cloud should be facing this problem so I guess I have the wrong approach ...
This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.
@brancz Hi, has this issue been fixed?
I think the issue is on the expression used: 100 * (count by(job) (up == 0) / count by(job) (up)) > 10
.
When my exporter-nodes comes back up the expression count by(job) (up == 0)
returns "no data".
Imho, we should have a 0 instead. Then the expression becomes 100 * (0 / count by(job) (up)) > 10
.
But again, the result returns "no data", because the expression 0 > 10
returns "no data".
It appears that when a correct expression gets 0 results the default answer is "no data".
I may be wrong, but I think a mathematical expression should return 0 instead of "no data" in that case.
This issue has been automatically marked as stale because it has not had any activity in the last 60 days. Thank you for your contributions.
This issue was closed because it has not had any activity in the last 120 days. Please reopen if you feel this is still valid.