consul_exporter icon indicating copy to clipboard operation
consul_exporter copied to clipboard

Raft Peers isn't correct

Open lswith opened this issue 8 years ago • 3 comments

The raft_peers metric isn't actually the amount of peers.

Test

I tested this by removing a peer from our consul cluster and watching the metric. It stays identical to the configuration (which is not an the actual amount of consul peers)

Solution

Looking into the codebase for consul, the info command gives an accurate amount for the amount of raft_peers. This is done by hitting the /v1/agent/self http endpoint and parsing the json output.

"Stats": {
    "agent": {
      "check_monitors": "0",
      "check_ttls": "0",
      "checks": "6",
      "services": "4"
    },
    "build": {
      "prerelease": "",
      "revision": "'a9455cd",
      "version": "0.7.1"
    },
    "consul": {
      "bootstrap": "false",
      "known_datacenters": "1",
      "leader": "true",
      "leader_addr": "10.21.32.247:8300",
      "server": "true"
    },
    "raft": {
      "applied_index": "5738761",
      "commit_index": "5738761",
      "fsm_pending": "0",
      "last_contact": "never",
      "last_log_index": "5738761",
      "last_log_term": "14682",
      "last_snapshot_index": "5733069",
      "last_snapshot_term": "14682",
      "latest_configuration": "[{Suffrage:Voter ID:10.21.32.247:8300 Address:10.21.32.247:8300} {Suffrage:Voter ID:10.6.63.24:8300 Address:10.6.63.24:8300} {Suffrage:Voter ID:10.232.97.195:8300 Address:10.232.97.195:8300}]",
      "latest_configuration_index": "1",
      "num_peers": "2",
      "protocol_version": "1",
      "protocol_version_max": "3",
      "protocol_version_min": "0",
      "snapshot_version_max": "1",
      "snapshot_version_min": "0",
      "state": "Leader",
      "term": "14682"
    },
    "runtime": {
      "arch": "amd64",
      "cpu_count": "2",
      "goroutines": "948",
      "max_procs": "2",
      "os": "linux",
      "version": "go1.7.3"
    },
    "serf_lan": {
      "encrypted": "false",
      "event_queue": "0",
      "event_time": "585",
      "failed": "10",
      "health_score": "0",
      "intent_queue": "0",
      "left": "268",
      "member_time": "88458",
      "members": "494",
      "query_queue": "0",
      "query_time": "2"
    },
    "serf_wan": {
      "encrypted": "false",
      "event_queue": "0",
      "event_time": "1",
      "failed": "0",
      "health_score": "0",
      "intent_queue": "0",
      "left": "0",
      "member_time": "8",
      "members": "1",
      "query_queue": "0",
      "query_time": "1"
    }
  }

lswith avatar Mar 09 '17 00:03 lswith

this may be a bug in consul however, https://github.com/hashicorp/consul/issues/1562

lswith avatar Mar 09 '17 00:03 lswith

We can only report what consul tells us. If there's another better metric to use that we're missing you can send a PR to add it, but a quick look at that issue indicates that the value of this metric is correct.

brian-brazil avatar Mar 09 '17 01:03 brian-brazil

its not so much a question of are we doing this correctly but rather, what is this metric used for?

I think the consul team is still trying to figure out if having zombie peers in your peer configuration is useful or not.

I think the purpose of this metric is to give the amount of peers currently in the cluster, not what the configured amount of peers is. That's why consul info hits a different endpoint.

If that isn't the purpose of this metric, than a new metric should be added to better capture this information.

lswith avatar Mar 09 '17 01:03 lswith