swarmprom icon indicating copy to clipboard operation
swarmprom copied to clipboard

node_meta metrics are messy on Prometheus console

Open emma-qi opened this issue 7 years ago • 6 comments

I have tried to use Prometheus to monitor two docker swarms together refer to your swarmprom guide. Since Prometheus is not in the same overlay network with the monitored nodes, I tried to use static_config instead of dns_sd_configs:

  1. Deploy node-exporter, cadvisor, dockerd-exporter as global service on two docker swarm seperately.
  2. Add all node-exporter, cadvisor, dockerd-exporter targets using static_config in prometheus.yml eg. scrape_configs:
  • job_name: 'prometheus' static_configs:

    • targets: ['localhost:9090']
  • job_name: 'node-exporter' static_configs:

    • targets: ['infbjsrv35.cn.oracle.com:9100','infbjsrv36.cn.oracle.com:9100','infbjvm539.cn.oracle.com:9100','infbjvm223.cn.oracle.com:9100']
  1. Start Prometheus, alertmanager and unsee on another host(which is not node of any swarm) When check node_meta metrics on Prometheus console, I found the node_meta is messy. In each swarm, the node_meta data from one node will mismach each node exporter instance to composed a node meta metric. For eg. swarm “A” has two nodes: infbjsrv35.cn.oracle.com and infbjvm223.cn.oracle.com node_meta from http://infbjsrv35.cn.oracle.com:9100/metrics is node_meta{container_label_com_docker_swarm_node_id="n9x7iwqhqe51y80c00a5c16fd",node_id="n9x7iwqhqe51y80c00a5c16fd",node_name="infbjsrv35.cn.oracle.com"} 1

node_meta from http://infbjvm223.cn.oracle.com:9100/metrics is node_meta{container_label_com_docker_swarm_node_id="wx86gspnvhgdli8kq0k93m392",node_id="wx86gspnvhgdli8kq0k93m392",node_name="infbjvm223.cn.oracle.com"} 1

But from Prometheus console, the result of executing node_meta will show 4 metrics, mismached the instances and the node meta data: node_meta{container_label_com_docker_swarm_node_id="n9x7iwqhqe51y80c00a5c16fd",instance="infbjvm223.cn.oracle.com:9100",job="node-exporter",node_id="n9x7iwqhqe51y80c00a5c16fd",node_name="infbjsrv35.cn.oracle.com"} | 1 node_meta{container_label_com_docker_swarm_node_id="n9x7iwqhqe51y80c00a5c16fd",instance="infbjsrv35.cn.oracle.com:9100",job="node-exporter",node_id="n9x7iwqhqe51y80c00a5c16fd",node_name="infbjsrv35.cn.oracle.com"} | 1 node_meta{container_label_com_docker_swarm_node_id="wx86gspnvhgdli8kq0k93m392",instance="infbjvm223.cn.oracle.com:9100",job="node-exporter",node_id="wx86gspnvhgdli8kq0k93m392",node_name="infbjvm223.cn.oracle.com"} | 1 node_meta{container_label_com_docker_swarm_node_id="wx86gspnvhgdli8kq0k93m392",instance="infbjsrv35.cn.oracle.com:9100",job="node-exporter",node_id="wx86gspnvhgdli8kq0k93m392",node_name="infbjvm223.cn.oracle.com"} | 1

I can not understand why this happen, and why dns_sd_configs can collect the right node metadata. Can you help me?

emma-qi avatar Dec 26 '17 08:12 emma-qi

Have you recreated the node-exporter services on those servers? If so delete all node_meta metrics, wait 15s and it should be ok.

stefanprodan avatar Dec 26 '17 08:12 stefanprodan

Yes, I modified the prometheus.yml, add static_config targets, then rebuilt Prometheus image. Use this image in compose file to start Prometheus container. But the node meta data metrics does not matched with node exporter instance correctly, as described above.

emma-qi avatar Dec 26 '17 09:12 emma-qi

Sorry, I don't quite understand, what do you mean to "recreate the node-exporter" service?

emma-qi avatar Dec 28 '17 01:12 emma-qi

I'm having the same issue with node-exporter. The only way I've been able to fix this is by deploying node-exporter using swarm-exec using network host only. Seems that node-exporter collects metrics all over an overlay or bridge network if deployed in swarm mode which has been driving me nuts. Has anyone been able to figure out how to fix this? If it only works utilizing the host network then swarm mode is just not going to work.

nsaud01 avatar Mar 14 '18 12:03 nsaud01

Yes I'm having trouble with this metric as well. Mine shows two nodes with same id and label and metrics are quite messy

lcastrooliveira avatar Jan 29 '19 15:01 lcastrooliveira

Yes I'm having trouble with this metric as well. Mine shows two nodes with same id and label and metrics are quite messy

This might be an 'oops' in the grafana dashboard. It uses a table to show the results, but does not have the "Instant" box checked in the metric query, so it shows the metadata for each node in the cluster, once per 15s resolution, for whatever time range you're looking at in grafana. Check Instant and it will show you the latest values in the timerange you've selected

hamiltont avatar Aug 07 '19 17:08 hamiltont