vault icon indicating copy to clipboard operation
vault copied to clipboard

raft_peers metric is missed 12h later after start

Open skibadmitriy opened this issue 3 years ago • 5 comments

Describe the bug vault_raft_peers metric is missed 12h later after start

To Reproduce Steps to reproduce the behavior: Get metrics url 12h later after start

Expected behavior vault_raft_peers metric is presented

Environment: Vault v1.12.1 (e34f8a14fb7a88af4640b09f3ddbb5646b946d9c), built 2022-10-27T12:32:05Z

Vault server configuration file(s):

    telemetry {
      disable_hostname = true
      prometheus_retention_time = "12h"
    }

skibadmitriy avatar Dec 02 '22 13:12 skibadmitriy

Maybe I am wrong, but as far as I understand the prometheus_retention_time, it defines how long to serve a metric if its value does not change (in theory how long to hold it in memory when unchanging). Thus, it would make sense that the metrics about the peers disappears after 12 hours (since you configured 12h in prometheus_retention_time), unless you have a change in the peers during that time. I guess this is not the case, since the metric represents the configured number of peers (a different output would require a change in config).

f4z3r avatar Dec 02 '22 17:12 f4z3r

This is a bug, that is somewhat similar to https://github.com/hashicorp/consul/issues/13498

I got an initial response from a HashiCorp person over there, but they've stopped responding.

There's actually a whole family of related work needed in https://github.com/hashicorp/go-metrics, Vault, and Consul to make Prometheus metrics in these products work in the way a Prometheus user would want them to.

maxb avatar Dec 03 '22 11:12 maxb

This is a bug, that is somewhat similar to hashicorp/consul#13498

I got an initial response from a HashiCorp person over there, but they've stopped responding.

There's actually a whole family of related work needed in https://github.com/hashicorp/go-metrics, Vault, and Consul to make Prometheus metrics in these products work in the way a Prometheus user would want them to.

Unfortunately Amier is no longer with the company - I'll see if we can think about this more holistically, across products. Thanks Max!

heatherezell avatar Dec 05 '22 18:12 heatherezell

@heatherezell , any update? We defined an alert in prometheus for vault_raft_peers and we realized that the metric is simply gone... This is pretty dangerous and if this is the behavior for all other prometheus metrics then the alerts should only rely on metrics that are changing... Which is most pretty difficulty when one want's to check if the cluster has the members yet...

@maxb, @skibadmitriy any workarounds that you found?

Cajga avatar May 29 '25 08:05 Cajga

any workarounds that you found?

Only stopping using metrics for this and polling the HTTP API for the information instead.

maxb avatar May 30 '25 06:05 maxb