kube-prometheus icon indicating copy to clipboard operation
kube-prometheus copied to clipboard

node-exporter TargetDown

Open hpio opened this issue 5 years ago • 21 comments

What did you do?

Installed prometheus-operator using Helm chart found here: https://github.com/helm/charts/tree/master/stable/prometheus-operator . My GKE test cluster uses preemptible nodes and after nodes are preemptied I start getting alerts from Prometheus that node-exporter targets are down even though all node-exporters are up and running (I see metrics when port-forwarding to them)

Labels | State | Active Since | Value
-- | -- | -- | --
alertname="TargetDown"   job="node-exporter"  severity="warning" | firing | 2018-11-21 10:46:39.941756819 +0000 UTC | 66.66666666666666
❯ kubectl get po -n monitoring
NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-monitoring-prometheus-oper-alertmanager-0   2/2     Running   0          3h
monitoring-grafana-679df6bd5-vn2wz                       3/3     Running   4          1d
monitoring-kube-state-metrics-764d6d59df-k7829           1/1     Running   0          3h
monitoring-prometheus-node-exporter-4nnjq                1/1     Running   0          1d
monitoring-prometheus-node-exporter-ggvrs                1/1     Running   0          21h
monitoring-prometheus-node-exporter-hggzv                1/1     Running   0          21h
monitoring-prometheus-node-exporter-jwxcn                1/1     Running   0          1d
monitoring-prometheus-oper-operator-55564b6cbb-rdzqm     1/1     Running   0          2h
prometheus-monitoring-prometheus-oper-prometheus-0       3/3     Running   0          2h

What did you expect to see?

After nodes are removed/preemptied and new nodes are added node-exporter targets are refreshed and always pick up new nodes

What did you see instead? Under which circumstances? At the moment I have 4 nodes in my cluster but metrics in Prometheus are only available for 2 of them

Environment

  • Prometheus Operator version:

    quay.io/coreos/prometheus-operator:v0.25.0

  • Kubernetes version information:

kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.0", GitCommit:"0ed33881dc4355495f623c6f22e7dd0b7632b7c0", GitTreeState:"clean", BuildDate:"2018-09-27T17:05:32Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.11", GitCommit:"fa90543563c9cfafca69128ce8cd9ecd5941940f", GitTreeState:"clean", BuildDate:"2018-11-08T20:22:21Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

    GKE cluster created with terraform

  • Manifests:

insert manifests relevant to the issue
  • Prometheus Operator Logs:
haven't found any specific errors 

hpio avatar Nov 21 '18 13:11 hpio

~@hpio we just merged a PR last night that should fix this: https://github.com/coreos/prometheus-operator/pull/2146~

Sorry, this is a different issue.

squat avatar Nov 21 '18 13:11 squat

Can anyone give some input? I'm not sure if that's a bug or possibly configuration issue. Happy to provide more info if required

hpio avatar Nov 27 '18 09:11 hpio

I am using same kubernetes version and hit same error when I add new nodes. it always alert me rbac about apiserver in node-exporter pod.

autherlj avatar Dec 06 '18 06:12 autherlj

Can you share what you see on the /targets page of the Prometheus UI?

brancz avatar Dec 06 '18 08:12 brancz


Endpoint | State | Labels | Last Scrape | Error
-- | -- | -- | -- | --
http://172.21.1.2:9100/metrics | up | endpoint="metrics"                                    instance="172.21.1.2:9100"                                    namespace="monitoring"                                    pod="monitoring-prometheus-node-exporter-bkb7k"                                    service="monitoring-prometheus-node-exporter" | 8.432s ago |  
http://172.21.1.3:9100/metrics | down | endpoint="metrics"                                    instance="172.21.1.3:9100"                                    namespace="monitoring"                                    pod="monitoring-prometheus-node-exporter-fztb2"                                    service="monitoring-prometheus-node-exporter" | 22.145s ago | context deadline exceeded
http://172.21.1.4:9100/metrics | up | endpoint="metrics"                                    instance="172.21.1.4:9100"                                    namespace="monitoring"                                    pod="monitoring-prometheus-node-exporter-52v44"                                    service="monitoring-prometheus-node-exporter" | 19.288s ago |  
http://172.21.1.6:9100/metrics | up | endpoint="metrics"                                    instance="172.21.1.6:9100"                                    namespace="monitoring"                                    pod="monitoring-prometheus-node-exporter-h867x"                                    service="monitoring-prometheus-node-exporter" | 21.918s ago |  
http://172.21.1.7:9100/metrics | up | endpoint="metrics"                                    instance="172.21.1.7:9100"                                    namespace="monitoring"                                    pod="monitoring-prometheus-node-exporter-957sq"                                    service="monitoring-prometheus-node-exporter" | 12.433s ago |  

Target that is being reported as Down no longer exists, there's no node with that ip at the moment.

Nodes in that specific cluster have the following ip addresses: 172.21.1.2 172.21.1.4 172.21.1.5 172.21.1.6 172.21.1.7

As you can see Prometheus has not discovered node 172.21.1.5 but it keeps record of non-existing 172.21.1.3

hpio avatar Dec 06 '18 11:12 hpio

Which version of Prometheus are you running? I know there were a couple of versions where target updating infrastructure had some deadlocks and/or were lagging behind.

brancz avatar Dec 06 '18 12:12 brancz

Prometheus build 2.4.3 Prometheus operator: quay.io/coreos/prometheus-operator:v0.25.0

hpio avatar Dec 06 '18 16:12 hpio

could you try the latest release candidate to see if this is fixed? That would be v2.6.0-rc.0.

brancz avatar Dec 06 '18 17:12 brancz

Will send an update after the weekend, thanks

hpio avatar Dec 07 '18 09:12 hpio

Hi branch , I have updated Prometheus to latest stable 2.5.0 as 2.6.0-rc-0 was crashlooping (i know there's a fix to be merged to make it work).

With latest stable I still see the issue, at the moment 2 out of 5 node exporters are being reported down after nodes were preepmtied

hpio avatar Dec 11 '18 08:12 hpio

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

stale[bot] avatar Aug 14 '19 03:08 stale[bot]

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

stale[bot] avatar Oct 20 '19 11:10 stale[bot]

Are you still seeing this issue?

brancz avatar Oct 21 '19 08:10 brancz

I still recognise issues similar to that.

version 2.9.0


http://10.132.15.225:9100/metrics | DOWN | addonmanager_kubernetes_io_mode="Reconcile" instance="10.132.15.225:9100" job="kubernetes-service-endpoints" kubernetes_io_cluster_service="true" kubernetes_io_name="NodeExporter" kubernetes_name="node-exporter" kubernetes_namespace="monitoring" | 17.866s ago | 10s
-- | -- | -- | -- | -- | --
http://10.132.15.226:9100/metrics | DOWN | addonmanager_kubernetes_io_mode="Reconcile" instance="10.132.15.226:9100" job="kubernetes-service-endpoints" kubernetes_io_cluster_service="true" kubernetes_io_name="NodeExporter" kubernetes_name="node-exporter" kubernetes_namespace="monitoring" | 17.867s ago | 10s 

10.132.15.225,10.132.15.226 are no longer available in the gke cluster

csabaujvari avatar Oct 28 '19 17:10 csabaujvari

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

stale[bot] avatar Dec 27 '19 17:12 stale[bot]

/remove-lifecycle stale

Anyone got a workaround for this ?

vfiset avatar Mar 25 '20 12:03 vfiset

@vfiset what is the actual error you are seeing in the Prometheus targets page for the node-exporter instances in question?

brancz avatar Mar 26 '20 14:03 brancz

@brancz this alert is triggering each time a node is repalced:

    - alert: KubeletDown
      annotations:
        message: Kubelet has disappeared from Prometheus target discovery.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown
      expr: |
        absent(up{job="kubelet", metrics_path="/metrics"} == 1)
      for: 15m
      labels:
        severity: critical

The same happens on GKE when using regular nodes and a gcp maintenance occurs and one or many nodes are replaced.

Not quite sure what should I do. I guess everyone on public cloud should be facing this problem so I guess I have the wrong approach ...

vfiset avatar Mar 26 '20 15:03 vfiset

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

stale[bot] avatar May 25 '20 16:05 stale[bot]

@brancz Hi, has this issue been fixed?

frouzbeh avatar Sep 17 '21 05:09 frouzbeh

I think the issue is on the expression used: 100 * (count by(job) (up == 0) / count by(job) (up)) > 10.

When my exporter-nodes comes back up the expression count by(job) (up == 0) returns "no data". Imho, we should have a 0 instead. Then the expression becomes 100 * (0 / count by(job) (up)) > 10. But again, the result returns "no data", because the expression 0 > 10 returns "no data". It appears that when a correct expression gets 0 results the default answer is "no data".

I may be wrong, but I think a mathematical expression should return 0 instead of "no data" in that case.

trotro avatar Mar 03 '22 10:03 trotro

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. Thank you for your contributions.

github-actions[bot] avatar Jan 13 '23 03:01 github-actions[bot]

This issue was closed because it has not had any activity in the last 120 days. Please reopen if you feel this is still valid.

github-actions[bot] avatar May 14 '23 03:05 github-actions[bot]