vault icon indicating copy to clipboard operation
vault copied to clipboard

Missing telemetry core.unsealed metrics on standby nodes

Open exo-cedric opened this issue 4 years ago • 16 comments

Describe the bug

This is a follow-up on slightly different #9771

To Reproduce Steps to reproduce the behavior:

  1. Setup a Raft cluster
  2. Query metrics on a stand-by (Follower) node
  3. vault.core.unsealed metric is missing

Expected behavior A clear and concise description of what you expected to happen. vault.core.unsealed metric should be present like on active (Leader) node:

# wget -qO- http://127.0.0.1:18200/v1/sys/metrics | jq '.Gauges[] | select(.Name=="vault.core.unsealed")'
{
  "Name": "vault.core.unsealed",
  "Value": 1,
}

The lack of core.unsealed metrics for a HA standing-by node is problematic since it prevents to monitor the health of all HA nodes (and make sure HA is actually still available).

Environment:

  • Vault Server Version (retrieve with vault status): 1.5.0
  • Vault CLI Version (retrieve with vault version): 1.5.0
  • Server Operating System/Architecture: Linux x86_64

Vault server configuration file(s):

N/A

Additional context

Quickly going through core.go, core_metrics.go and ha.go, it seems to me that emitMetrics (which spawns the metrics Loop to refresh the core.unsealed metric) is only called via postUnseal, which is not called for HA standing-by HA node (in core.go); only the Leader/Active node actually calls postUnseal (in ha.go).

exo-cedric avatar Sep 22 '20 15:09 exo-cedric

Facing the same issue on 1.6.x versions as well. Is this something that will be fixed?

alwaysastudent avatar May 08 '21 00:05 alwaysastudent

Hello! Any update on this? This is blocking us from abandoning vault_exporter.

KawaiDesu avatar Oct 01 '21 15:10 KawaiDesu

This issue is seen on Vault_version: 1.7.2, i m using statsite telemetry provider.

hellstrikes13 avatar Oct 12 '21 08:10 hellstrikes13

Wanting to chime in that we're still working on a resolution for this. Thanks for your patience!

heatherezell avatar Dec 15 '21 20:12 heatherezell

@hsimon-hashicorp Hi, have any updates on this? It's important to emit metrics on the standby node in HA mode.

Bowser1704 avatar Apr 14 '22 03:04 Bowser1704

We just upgraded to Vault 1.11.3. We saw all Vault replicas export vault_core_unsealed for 12h (the value of our prometheus_retention_time), but without the cluster label. The leader also exported one with the cluster label. After 12 hours, the unlabeled ones disappeared.

I'm going to guess they just hadn't finished determining they were a cluster yet, and as soon as they went into HA standby mode, the standbys started hitting this bug and not reporting the metric.

geekofalltrades avatar Sep 26 '22 16:09 geekofalltrades

Just some more 2 cents "vault.core.unsealed" is missing, but the very basic "vault.core.active" is also missing .... Probably issue is related to the full fault.core telemetry namespace ?

dguihal avatar Nov 24 '22 13:11 dguihal

I see the same behaviour with Vault 1.12.1 and missing vault_core_active metric after some time. We've used the absence of that metric to determine missing leaders and got alerted by prometheus many times in the past.

We have a 3 node Vault setup with Raft storage deployed in K8s. I've queried the metrics endpoint from each pod and the metric is missing everywhere. Also, the vault-active service does not include the metric (as expected if it's missing on the pods themselves.)

laugmanuel avatar Dec 09 '22 13:12 laugmanuel

@hsimon-hashicorp any updates about this issue? the lack of reliable core metrics makes it very difficult to properly monitor vault using prometheus.

claviola avatar Jan 24 '23 16:01 claviola

also fased this problem, also needs resolution

none0nfg avatar Feb 08 '23 01:02 none0nfg

+1

konstantin-921 avatar Feb 09 '23 11:02 konstantin-921

Hello, Any update? This really makes the unsealed metric useless. Thanks.

cadmuxe avatar May 18 '23 23:05 cadmuxe

Any update on this? Its been more than 3 years... The issue is still open

p-k-sharma avatar Oct 27 '23 09:10 p-k-sharma

Do I understand right? There is no way of knowing with prometheus if a VM on a HA cluster is sealed as long as some are unsealed. Does anyone find a solution to this? I really do not want to wait until (the whole cluster) vault is sealed before I get an alert. It defeats the purpose of HA setup where you can fix issues as they happen while keeping Vault unsealed. I just check and this seems the case for Enterprise Vault too

ameflorenti avatar Dec 05 '23 16:12 ameflorenti

WORKAROUND:
While trying to get labels values for cluster I noticed that Vault does not return metrics of sealed nodes. I then named and organized the Prometheus jobs per cluster as i did in Vault. This as a way getting the list of nodes in a cluster even when sealed. Using this query I can "deduce" that the nodes in the HA cluster not returning metrics are SEALED or UNAVAILABLE to Vault. count by (instance)(up{job="$cluster"}) unless on(instance) count by (instance)(vault_core_unsealed{job="$cluster"})

makes sense?

ameflorenti avatar Dec 06 '23 15:12 ameflorenti

up{job="vault"} > 0 unless on(instance) vault_core_unsealed

for warning alert (part of vault instance sealed)

sum(vault_core_unsealed) < 1 or absent(vault_core_unsealed)

for critical alert

I think this is a problem needs to be solved, but currently I can only use this workaround

LeoQuote avatar Dec 19 '23 10:12 LeoQuote

Is this still an issue? I'm seeing vault_core_unsealed metrics even from standby nodes, but note that according to the docs, you need to enable unauthenticated access:

"The /v1/sys/metrics endpoint is only accessible on active nodes and automatically disabled on standby nodes. You can enable the /v1/sys/metrics endpoint on standby nodes by enabling unauthenticated metrics access."

This is on an HA setup in K8s with the Vault Helm chart v0.25.0 and Vault v1.14.0.

When all are sealed:

$ for POD in {0..2}; do echo -n "vault-$POD: "; k get pod vault-$POD -oyaml | grep vault-active || echo; done
vault-0:     vault-active: "false"
vault-1:     vault-active: "false"
vault-2:     vault-active: "false"

$ for POD in {0..2}; do echo -n "vault-$POD: "; k exec -it vault-$POD -- /bin/sh -c "wget -qO - localhost:8200/v1/sys/metrics?format=prometheus" | grep "^vault_core_unsealed" || echo; done
vault-0: vault_core_unsealed{cluster="pace-vault"} 0
vault-1: vault_core_unsealed{cluster="pace-vault"} 0
vault-2: vault_core_unsealed{cluster="pace-vault"} 0

When all are unsealed:

$ for POD in {0..2}; do echo -n "vault-$POD: "; k get pod vault-$POD -oyaml | grep vault-active || echo; done
vault-0:     vault-active: "true"
vault-1:     vault-active: "false"
vault-2:     vault-active: "false"

$ for POD in {0..2}; do echo -n "vault-$POD: "; k exec -it vault-$POD -- /bin/sh -c "wget -qO - localhost:8200/v1/sys/metrics?format=prometheus" | grep "^vault_core_unsealed" || echo; done
vault-0: vault_core_unsealed{cluster="pace-vault"} 1
vault-1: vault_core_unsealed{cluster="pace-vault"} 1
vault-2: vault_core_unsealed{cluster="pace-vault"} 1

Make sure to set the cluster_name field in the config to avoid duplicate metrics: https://github.com/hashicorp/vault/issues/11988

I ran into another issue specific to the Vault Helm chart that caused metrics to disappear when all Vault pods are sealed, which we had to work around: https://github.com/hashicorp/vault-helm/issues/990

And I'm running into another problem that I'm about to file an issue for where specifically the vault_core_unsealed metric disappears after the prometheus_retention_time elapses, because apparently Vault doesn't periodically refresh the metric when it doesn't change.

cascadia-sati avatar Jan 09 '24 13:01 cascadia-sati

Hi folks. Just checking through older bugs. As @cascadia-sati mentioned, it would seem like this is not an issue any more, does anyone on this thread still see this problem?

As far as I can see this was actually fixed by https://github.com/hashicorp/vault/pull/12166 a couple of years ago - I've looked through the code and can confirm that now runStandby calls metricsLoop which is what outputs this.

Closing for now, please let us know if someone is still seeing this on a version of Vault after 1.13.0.

banks avatar Jul 24 '24 14:07 banks