The raft_peers metric isn't actually the amount of peers.

Test

I tested this by removing a peer from our consul cluster and watching the metric. It stays identical to the configuration (which is not an the actual amount of consul peers)

Solution

Looking into the codebase for consul, the info command gives an accurate amount for the amount of raft_peers. This is done by hitting the /v1/agent/self http endpoint and parsing the json output.

"Stats": {
    "agent": {
      "check_monitors": "0",
      "check_ttls": "0",
      "checks": "6",
      "services": "4"
    },
    "build": {
      "prerelease": "",
      "revision": "'a9455cd",
      "version": "0.7.1"
    },
    "consul": {
      "bootstrap": "false",
      "known_datacenters": "1",
      "leader": "true",
      "leader_addr": "10.21.32.247:8300",
      "server": "true"
    },
    "raft": {
      "applied_index": "5738761",
      "commit_index": "5738761",
      "fsm_pending": "0",
      "last_contact": "never",
      "last_log_index": "5738761",
      "last_log_term": "14682",
      "last_snapshot_index": "5733069",
      "last_snapshot_term": "14682",
      "latest_configuration": "[{Suffrage:Voter ID:10.21.32.247:8300 Address:10.21.32.247:8300} {Suffrage:Voter ID:10.6.63.24:8300 Address:10.6.63.24:8300} {Suffrage:Voter ID:10.232.97.195:8300 Address:10.232.97.195:8300}]",
      "latest_configuration_index": "1",
      "num_peers": "2",
      "protocol_version": "1",
      "protocol_version_max": "3",
      "protocol_version_min": "0",
      "snapshot_version_max": "1",
      "snapshot_version_min": "0",
      "state": "Leader",
      "term": "14682"
    },
    "runtime": {
      "arch": "amd64",
      "cpu_count": "2",
      "goroutines": "948",
      "max_procs": "2",
      "os": "linux",
      "version": "go1.7.3"
    },
    "serf_lan": {
      "encrypted": "false",
      "event_queue": "0",
      "event_time": "585",
      "failed": "10",
      "health_score": "0",
      "intent_queue": "0",
      "left": "268",
      "member_time": "88458",
      "members": "494",
      "query_queue": "0",
      "query_time": "2"
    },
    "serf_wan": {
      "encrypted": "false",
      "event_queue": "0",
      "event_time": "1",
      "failed": "0",
      "health_score": "0",
      "intent_queue": "0",
      "left": "0",
      "member_time": "8",
      "members": "1",
      "query_queue": "0",
      "query_time": "1"
    }
  }

Mar 09 '17 00:03 lswith

this may be a bug in consul however, https://github.com/hashicorp/consul/issues/1562

Mar 09 '17 00:03 lswith

We can only report what consul tells us. If there's another better metric to use that we're missing you can send a PR to add it, but a quick look at that issue indicates that the value of this metric is correct.

Mar 09 '17 01:03 brian-brazil

its not so much a question of are we doing this correctly but rather, what is this metric used for?

I think the consul team is still trying to figure out if having zombie peers in your peer configuration is useful or not.

I think the purpose of this metric is to give the amount of peers currently in the cluster, not what the configured amount of peers is. That's why consul info hits a different endpoint.

If that isn't the purpose of this metric, than a new metric should be added to better capture this information.

Mar 09 '17 01:03 lswith

consul_exporter
consul_exporter copied to clipboard

Raft Peers isn't correct

Test

Solution

consul_exporter consul_exporter copied to clipboard

Raft Peers isn't correct

Test

Solution

consul_exporter
consul_exporter copied to clipboard