raft_peers metric is missed 12h later after start
Describe the bug vault_raft_peers metric is missed 12h later after start
To Reproduce Steps to reproduce the behavior: Get metrics url 12h later after start
Expected behavior vault_raft_peers metric is presented
Environment: Vault v1.12.1 (e34f8a14fb7a88af4640b09f3ddbb5646b946d9c), built 2022-10-27T12:32:05Z
Vault server configuration file(s):
telemetry {
disable_hostname = true
prometheus_retention_time = "12h"
}
Maybe I am wrong, but as far as I understand the prometheus_retention_time, it defines how long to serve a metric if its value does not change (in theory how long to hold it in memory when unchanging). Thus, it would make sense that the metrics about the peers disappears after 12 hours (since you configured 12h in prometheus_retention_time), unless you have a change in the peers during that time. I guess this is not the case, since the metric represents the configured number of peers (a different output would require a change in config).
This is a bug, that is somewhat similar to https://github.com/hashicorp/consul/issues/13498
I got an initial response from a HashiCorp person over there, but they've stopped responding.
There's actually a whole family of related work needed in https://github.com/hashicorp/go-metrics, Vault, and Consul to make Prometheus metrics in these products work in the way a Prometheus user would want them to.
This is a bug, that is somewhat similar to hashicorp/consul#13498
I got an initial response from a HashiCorp person over there, but they've stopped responding.
There's actually a whole family of related work needed in https://github.com/hashicorp/go-metrics, Vault, and Consul to make Prometheus metrics in these products work in the way a Prometheus user would want them to.
Unfortunately Amier is no longer with the company - I'll see if we can think about this more holistically, across products. Thanks Max!
@heatherezell , any update? We defined an alert in prometheus for vault_raft_peers and we realized that the metric is simply gone... This is pretty dangerous and if this is the behavior for all other prometheus metrics then the alerts should only rely on metrics that are changing... Which is most pretty difficulty when one want's to check if the cluster has the members yet...
@maxb, @skibadmitriy any workarounds that you found?
any workarounds that you found?
Only stopping using metrics for this and polling the HTTP API for the information instead.