vault icon indicating copy to clipboard operation
vault copied to clipboard

One node not unsealing with auto-unseal

Open mbrancato opened this issue 4 years ago • 7 comments

Describe the bug

Using the Azure auto unseal (with Azure user-assigned identity), one node in a cluster did not become unsealed after vault operator init. A reboot of that node seemed to fix things. The backend storage is consul.

To Reproduce Steps to reproduce the behavior:

  1. Run vault operator init
  2. Run vault status
  3. See error

Expected behavior

All nodes should unseal

Environment:

  • Vault Server Version (retrieve with vault status): 1.3.4+prem
  • Vault CLI Version (retrieve with vault version): 1.4.0
  • Server Operating System/Architecture: RHEL 7, x86_64

Vault server configuration file(s):

storage "consul" {
  address          = "127.0.0.1:8500"
  path             = "vault"
  token            = "***"
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  
  tls_disable = false
  tls_cert_file = "/path/vault.crt"
  tls_key_file = "/path/vault.key"
  
}

seal "azurekeyvault" {
  tenant_id     = "***"
  vault_name    = "mykeyvault"
  key_name      = "my-vault-key"
}

Additional context

Below are logs from the node that did not unseal properly.

2020-04-21T01:49:48.129Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
2020-04-21T01:49:48.129Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2020-04-21T01:49:48.129Z [INFO]  core: entering standby mode
2020-04-21T01:49:48.132Z [INFO]  core: vault is unsealed
2020-04-21T01:49:48.132Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
2020-04-21T01:49:48.178Z [INFO]  core: acquired lock, enabling active operation
2020-04-21T01:49:48.603Z [INFO]  core: post-unseal setup starting
2020-04-21T01:49:48.649Z [INFO]  replication.wal: no log directories found: root=wal/logs/
2020-04-21T01:49:49.160Z [WARN]  replication.index.perf: unexpected number of index pages found: pages_restored=1 expect=256
2020-04-21T01:49:49.164Z [INFO]  replication.index.perf: no checkpoint found
2020-04-21T01:49:49.164Z [INFO]  replication.index: no previous commit, forcing re-index: path-type=1
2020-04-21T01:49:49.880Z [WARN]  replication.index.local: unexpected number of index pages found: pages_restored=1 expect=256
2020-04-21T01:49:49.883Z [INFO]  replication.index.local: no checkpoint found
2020-04-21T01:49:49.883Z [INFO]  replication.index: no previous commit, forcing re-index: path-type=2
2020-04-21T01:49:49.908Z [INFO]  core: loaded wrapping token key
2020-04-21T01:49:49.908Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
2020-04-21T01:49:49.913Z [INFO]  core: successfully mounted backend: type=system path=sys/
2020-04-21T01:49:49.913Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2020-04-21T01:49:49.913Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2020-04-21T01:49:49.939Z [INFO]  core: successfully enabled credential backend: type=token path=token/
2020-04-21T01:49:49.939Z [INFO]  core: restoring leases
2020-04-21T01:49:49.939Z [INFO]  rollback: starting rollback manager
2020-04-21T01:49:49.942Z [INFO]  expiration: lease restore complete
2020-04-21T01:49:49.946Z [INFO]  identity: entities restored
2020-04-21T01:49:49.948Z [INFO]  identity: groups restored
2020-04-21T01:49:49.950Z [INFO]  mfa: configurations restored
2020-04-21T01:49:49.975Z [INFO]  core: stopping replication
2020-04-21T01:49:49.975Z [INFO]  core: closed sync connection
2020-04-21T01:49:49.975Z [INFO]  core: replication stopped
2020-04-21T01:49:49.975Z [INFO]  core: setting up replication
2020-04-21T01:49:49.982Z [ERROR] core: a merkle tree reindex is required before replication can be started
2020-04-21T01:49:49.982Z [INFO]  core: reindex is running in background, replication will be started once reindex completes
2020-04-21T01:49:49.982Z [INFO]  core: replication setup finished
2020-04-21T01:49:49.983Z [INFO]  core: running replication reindexing: diff=false force=false toggle replication=true no flush=false trees=2
2020-04-21T01:49:50.099Z [WARN]  core: post-unseal upgrade seal keys failed: error="no recovery key found"
2020-04-21T01:49:50.137Z [INFO]  core: post-unseal setup complete
2020-04-21T01:49:50.146Z [INFO]  replication.index: starting storage scan
2020-04-21T01:49:50.217Z [INFO]  replication.index: key scanning complete: num_keys=14
2020-04-21T01:49:50.289Z [INFO]  replication.wal: prepared to replay logs: diff=1 commit_index=0 last_index=1
2020-04-21T01:49:50.291Z [INFO]  replication.index: re-index all keys, root hash mis-match
2020-04-21T01:49:50.291Z [INFO]  replication.index.perf: re-index all keys, root hash mis-match
2020-04-21T01:49:50.449Z [INFO]  replication.index.local: re-index all keys, root hash mis-match
2020-04-21T01:49:50.580Z [INFO]  core: replication index repaired, starting replication
2020-04-21T01:49:50.580Z [INFO]  core: stopping replication
2020-04-21T01:49:50.580Z [INFO]  core: closed sync connection
2020-04-21T01:49:50.580Z [INFO]  core: replication stopped
2020-04-21T01:49:50.580Z [INFO]  core: setting up replication
2020-04-21T01:49:50.585Z [INFO]  replication.index: clean merkle tree started
2020-04-21T01:49:50.585Z [WARN]  replication: replication is in a failure state, please reindex the merkle tree
2020-04-21T01:49:50.598Z [INFO]  replication.index: clean merkle tree complete: keys_cleaned=0
2020-04-21T01:49:50.619Z [INFO]  core: replicated cluster information not found or disabled, not activating client
2020-04-21T01:49:50.619Z [INFO]  core: replication setup finished
2020-04-21T01:49:50.662Z [INFO]  core: replication reindex complete: fixed_pages=11
2020-04-21T01:49:57.631Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery

mbrancato avatar Apr 21 '20 03:04 mbrancato

Hi, @mbrancato. I've not been successful in reproducing this issue given the steps provided. I'm able to use Azure Key Vault to successfully auto-unseal all 3 Vault instances in an HA cluster. I've done so using both user assigned managed identity and an app registration (using client_id and client_secret). Can you provide me with any additional context that could be of help here?

Some questions I have:

  • Does this consistently reproduce for you?
  • Can you provide logs from the other cluster members?
  • Are you using enterprise replication?
  • Where are each of the Vault instances running?

austingebauer avatar May 15 '20 17:05 austingebauer

Hey @austingebauer

It does seem this is inconsistent. We have tried 3 times in Azure and hit this issue twice. Additionally, we hit this same issue in Google GCE as well. I have the log files from the GCE nodes, I will try to replicate again in Azure soon.

We plan to use enterprise replication, but I think we need to unseal first - we haven't setup the performance replicas yet.

The instances are in the same region but different availability zones.

Logs are below. Node 0 is the new that ended up failing, I'm not sure which node served the initial operator init request. Timestamps below should be synchronized.

vault-gcp-0 vault: 2020-05-18T15:49:31.517Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-0 vault: 2020-05-18T15:49:31.519Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-gcp-0 vault: 2020-05-18T15:49:33.428Z [INFO]  core: security barrier not initialized
vault-gcp-0 vault: 2020-05-18T15:49:33.604Z [INFO]  core: security barrier not initialized
vault-gcp-0 vault: 2020-05-18T15:49:35.417Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=barrier
vault-gcp-0 vault: 2020-05-18T15:49:35.417Z [ERROR] core: error checking health: error="core: barrier reports initialized but no seal configuration found"
vault-gcp-0 vault: 2020-05-18T15:49:36.519Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-0 vault: 2020-05-18T15:49:36.770Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
vault-gcp-0 vault: 2020-05-18T15:49:36.770Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
vault-gcp-0 vault: 2020-05-18T15:49:36.770Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:444
vault-gcp-0 vault: 2020-05-18T15:49:36.771Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:444
vault-gcp-0 vault: 2020-05-18T15:49:36.771Z [INFO]  core: entering standby mode
vault-gcp-0 vault: 2020-05-18T15:49:36.778Z [INFO]  core: vault is unsealed
vault-gcp-0 vault: 2020-05-18T15:49:36.778Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
vault-gcp-0 consul: 2020/05/18 15:49:36 [INFO] agent: Synced check "vault:10.213.95.182:443:vault-sealed-check"
vault-gcp-0 vault: 2020-05-18T15:49:36.842Z [INFO]  core: acquired lock, enabling active operation
vault-gcp-0 vault: 2020-05-18T15:49:37.275Z [INFO]  core: post-unseal setup starting
vault-gcp-0 vault: 2020-05-18T15:49:37.412Z [INFO]  replication.wal: wal range recovered: first_wal=1 last_wal=12
vault-gcp-0 vault: 2020-05-18T15:49:37.808Z [WARN]  replication.index.perf: unexpected number of index pages found: pages_restored=7 expect=256
vault-gcp-0 vault: 2020-05-18T15:49:37.811Z [INFO]  replication.index.perf: checkpoint recovery complete
vault-gcp-0 vault: 2020-05-18T15:49:38.174Z [WARN]  replication.index.local: unexpected number of index pages found: pages_restored=7 expect=256
vault-gcp-0 vault: 2020-05-18T15:49:38.177Z [INFO]  replication.index.local: checkpoint recovery complete
vault-gcp-0 vault: 2020-05-18T15:49:38.177Z [INFO]  replication.wal: prepared to replay logs: diff=18446744073709551612 commit_index=16 last_index=12
vault-gcp-0 vault: 2020-05-18T15:49:38.239Z [INFO]  core: loaded wrapping token key
vault-gcp-0 vault: 2020-05-18T15:49:38.239Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
vault-gcp-0 vault: 2020-05-18T15:49:38.245Z [INFO]  core: successfully mounted backend: type=system path=sys/
vault-gcp-0 vault: 2020-05-18T15:49:38.245Z [INFO]  core: successfully mounted backend: type=identity path=identity/
vault-gcp-0 vault: 2020-05-18T15:49:38.245Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
vault-gcp-0 vault: 2020-05-18T15:49:38.263Z [INFO]  core: successfully enabled credential backend: type=token path=token/
vault-gcp-0 vault: 2020-05-18T15:49:38.263Z [INFO]  core: restoring leases
vault-gcp-0 vault: 2020-05-18T15:49:38.263Z [INFO]  rollback: starting rollback manager
vault-gcp-0 vault: 2020-05-18T15:49:38.267Z [INFO]  expiration: lease restore complete
vault-gcp-0 vault: 2020-05-18T15:49:38.271Z [INFO]  identity: entities restored
vault-gcp-0 vault: 2020-05-18T15:49:38.273Z [INFO]  identity: groups restored
vault-gcp-0 vault: 2020-05-18T15:49:38.275Z [INFO]  mfa: configurations restored
vault-gcp-0 vault: 2020-05-18T15:49:38.307Z [INFO]  core: stopping replication
vault-gcp-0 vault: 2020-05-18T15:49:38.307Z [INFO]  core: closed sync connection
vault-gcp-0 vault: 2020-05-18T15:49:38.307Z [INFO]  core: replication stopped
vault-gcp-0 vault: 2020-05-18T15:49:38.307Z [INFO]  core: setting up replication
vault-gcp-0 vault: 2020-05-18T15:49:38.313Z [INFO]  replication.index: clean merkle tree started
vault-gcp-0 vault: 2020-05-18T15:49:38.410Z [INFO]  core: replicated cluster information not found or disabled, not activating client
vault-gcp-0 vault: 2020-05-18T15:49:38.410Z [INFO]  core: replication setup finished
vault-gcp-0 vault: 2020-05-18T15:49:38.455Z [INFO]  replication.index: clean merkle tree complete: keys_cleaned=0
vault-gcp-0 vault: 2020-05-18T15:49:38.456Z [WARN]  core: post-unseal upgrade seal keys failed: error="no recovery key found"
vault-gcp-0 vault: 2020-05-18T15:49:38.465Z [INFO]  core: post-unseal setup complete
vault-gcp-0 consul: 2020/05/18 15:49:38 [INFO] agent: Synced service "vault:10.213.95.182:443"
vault-gcp-0 consul: 2020/05/18 15:49:38 [INFO] agent: Synced check "vault:10.213.95.182:443:vault-sealed-check"
vault-gcp-0 vault: 2020-05-18T15:49:39.671Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:49:49.838Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:49:53.010Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:50:00.000Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:50:10.162Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:50:19.499Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:50:20.325Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:50:30.486Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:50:41.095Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:50:52.425Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:51:03.053Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:51:13.444Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:51:24.456Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-0 vault: 2020-05-18T15:51:35.213Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-gcp-1 vault: 2020-05-18T15:49:32.140Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-1 vault: 2020-05-18T15:49:32.142Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-gcp-1 vault: 2020-05-18T15:49:37.142Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-1 vault: 2020-05-18T15:49:37.344Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
vault-gcp-1 vault: 2020-05-18T15:49:37.344Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
vault-gcp-1 vault: 2020-05-18T15:49:37.344Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:444
vault-gcp-1 vault: 2020-05-18T15:49:37.344Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:444
vault-gcp-1 vault: 2020-05-18T15:49:37.344Z [INFO]  core: entering standby mode
vault-gcp-1 vault: 2020-05-18T15:49:37.351Z [INFO]  core: vault is unsealed
vault-gcp-1 vault: 2020-05-18T15:49:37.351Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
vault-gcp-1 consul: 2020/05/18 15:49:37 [INFO] agent: Synced check "vault:10.213.95.156:443:vault-sealed-check"
vault-gcp-2 vault: 2020-05-18T15:49:16.379Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-2 vault: 2020-05-18T15:49:16.382Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-gcp-2 vault: 2020-05-18T15:49:17.694Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:17.788Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:17.934Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:18.317Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:18.990Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:19.423Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:21.382Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-2 vault: 2020-05-18T15:49:21.385Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-gcp-2 vault: 2020-05-18T15:49:26.385Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-2 vault: 2020-05-18T15:49:26.388Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-gcp-2 vault: 2020-05-18T15:49:27.634Z [INFO]  core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery
vault-gcp-2 vault: 2020-05-18T15:49:27.695Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:27.791Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:27.935Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:28.318Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:28.992Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:29.424Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:31.388Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-2 vault: 2020-05-18T15:49:31.391Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-gcp-2 vault: 2020-05-18T15:49:34.894Z [WARN]  core: stored keys supported on init, forcing shares/threshold to 1
vault-gcp-2 vault: 2020-05-18T15:49:34.898Z [INFO]  core: security barrier not initialized
vault-gcp-2 vault: 2020-05-18T15:49:35.332Z [INFO]  core: security barrier initialized: stored=1 shares=1 threshold=1
vault-gcp-2 vault: 2020-05-18T15:49:35.620Z [INFO]  core: post-unseal setup starting
vault-gcp-2 vault: 2020-05-18T15:49:35.756Z [INFO]  replication.wal: no log directories found: root=wal/logs/
vault-gcp-2 vault: 2020-05-18T15:49:36.061Z [INFO]  replication.index.local: no index pages found
vault-gcp-2 vault: 2020-05-18T15:49:36.064Z [INFO]  replication.index.local: no checkpoint found
vault-gcp-2 vault: 2020-05-18T15:49:36.064Z [INFO]  replication.index: no previous commit, forcing re-index: path-type=2
vault-gcp-2 vault: 2020-05-18T15:49:36.373Z [INFO]  replication.index.perf: no index pages found
vault-gcp-2 vault: 2020-05-18T15:49:36.376Z [INFO]  replication.index.perf: no checkpoint found
vault-gcp-2 vault: 2020-05-18T15:49:36.376Z [INFO]  replication.index: no previous commit, forcing re-index: path-type=1
vault-gcp-2 vault: 2020-05-18T15:49:36.669Z [INFO]  core: loaded wrapping token key
vault-gcp-2 vault: 2020-05-18T15:49:36.669Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
vault-gcp-2 vault: 2020-05-18T15:49:36.673Z [INFO]  core: no mounts; adding default mount table
vault-gcp-2 vault: 2020-05-18T15:49:36.693Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
vault-gcp-2 vault: 2020-05-18T15:49:36.693Z [INFO]  core: successfully mounted backend: type=system path=sys/
vault-gcp-2 vault: 2020-05-18T15:49:36.693Z [INFO]  core: successfully mounted backend: type=identity path=identity/
vault-gcp-2 vault: 2020-05-18T15:49:36.758Z [INFO]  core: successfully enabled credential backend: type=token path=token/
vault-gcp-2 vault: 2020-05-18T15:49:36.758Z [INFO]  core: restoring leases
vault-gcp-2 vault: 2020-05-18T15:49:36.758Z [INFO]  rollback: starting rollback manager
vault-gcp-2 vault: 2020-05-18T15:49:36.761Z [INFO]  expiration: lease restore complete
vault-gcp-2 vault: 2020-05-18T15:49:36.791Z [INFO]  identity: entities restored
vault-gcp-2 vault: 2020-05-18T15:49:36.793Z [INFO]  identity: groups restored
vault-gcp-2 vault: 2020-05-18T15:49:36.794Z [INFO]  mfa: configurations restored
vault-gcp-2 vault: 2020-05-18T15:49:36.796Z [INFO]  core: stopping replication
vault-gcp-2 vault: 2020-05-18T15:49:36.796Z [INFO]  core: closed sync connection
vault-gcp-2 vault: 2020-05-18T15:49:36.796Z [INFO]  core: replication stopped
vault-gcp-2 vault: 2020-05-18T15:49:36.796Z [INFO]  core: setting up replication
vault-gcp-2 vault: 2020-05-18T15:49:36.798Z [ERROR] core: a merkle tree reindex is required before replication can be started
vault-gcp-2 vault: 2020-05-18T15:49:36.798Z [INFO]  core: running replication reindexing: diff=false force=false toggle replication=false no flush=false trees=2
vault-gcp-2 vault: 2020-05-18T15:49:36.968Z [INFO]  replication.index: starting storage scan
vault-gcp-2 vault: 2020-05-18T15:49:36.981Z [INFO]  replication.index: key scanning complete: num_keys=10
vault-gcp-2 vault: 2020-05-18T15:49:37.366Z [INFO]  core: replication reindex complete: fixed_pages=0
vault-gcp-2 vault: 2020-05-18T15:49:37.384Z [INFO]  core: replicated cluster information not found or disabled, not activating client
vault-gcp-2 vault: 2020-05-18T15:49:37.384Z [INFO]  core: replication setup finished
vault-gcp-2 vault: 2020-05-18T15:49:37.449Z [WARN]  core: post-unseal upgrade seal keys failed: error="no recovery key found"
vault-gcp-2 vault: 2020-05-18T15:49:37.457Z [INFO]  core: post-unseal setup complete
vault-gcp-2 vault: 2020-05-18T15:49:37.776Z [INFO]  core: root token generated
vault-gcp-2 vault: 2020-05-18T15:49:37.776Z [INFO]  core: pre-seal teardown starting
vault-gcp-2 vault: 2020-05-18T15:49:37.776Z [INFO]  core: stopping replication
vault-gcp-2 vault: 2020-05-18T15:49:37.776Z [INFO]  core: closed sync connection
vault-gcp-2 vault: 2020-05-18T15:49:37.776Z [INFO]  core: replication stopped
vault-gcp-2 vault: 2020-05-18T15:49:37.776Z [INFO]  rollback: stopping rollback manager
vault-gcp-2 vault: 2020-05-18T15:49:37.811Z [INFO]  core: pre-seal teardown complete
vault-gcp-2 vault: 2020-05-18T15:49:37.811Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-gcp-2 vault: 2020-05-18T15:49:38.138Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
vault-gcp-2 vault: 2020-05-18T15:49:38.138Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
vault-gcp-2 vault: 2020-05-18T15:49:38.138Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:444
vault-gcp-2 vault: 2020-05-18T15:49:38.139Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:444
vault-gcp-2 vault: 2020-05-18T15:49:38.139Z [INFO]  core: entering standby mode
vault-gcp-2 vault: 2020-05-18T15:49:38.145Z [INFO]  core: vault is unsealed
vault-gcp-2 vault: 2020-05-18T15:49:38.145Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
vault-gcp-2 vault: 2020-05-18T15:49:38.145Z [WARN]  core: attempted unseal with stored keys, but vault is already unsealed
vault-gcp-2 consul: 2020/05/18 15:49:38 [INFO] agent: Synced check "vault:10.213.95.138:443:vault-sealed-check"

mbrancato avatar May 19 '20 17:05 mbrancato

I was able to reproduce this again on Azure. I'm less convinced this is a cloud platform-specific auto-unseal issue. Below are the logs for all nodes. The second node did not fully unseal.

vault-azure-0 vault: 2020-05-19T21:28:37.352Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-azure-0 vault: 2020-05-19T21:28:40.524Z [INFO]  core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery
vault-azure-0 vault: 2020-05-19T21:28:40.527Z [INFO]  core: security barrier not initialized
vault-azure-0 vault: 2020-05-19T21:28:40.784Z [WARN]  core: stored keys supported on init, forcing shares/threshold to 1
vault-azure-0 vault: 2020-05-19T21:28:40.786Z [INFO]  core: security barrier not initialized
vault-azure-0 vault: 2020-05-19T21:28:41.180Z [INFO]  core: security barrier initialized: stored=1 shares=1 threshold=1
vault-azure-0 vault: 2020-05-19T21:28:41.306Z [INFO]  core: post-unseal setup starting
vault-azure-0 vault: 2020-05-19T21:28:41.379Z [INFO]  replication.wal: no log directories found: root=wal/logs/
vault-azure-0 vault: 2020-05-19T21:28:41.678Z [INFO]  replication.index.perf: no index pages found
vault-azure-0 vault: 2020-05-19T21:28:41.680Z [INFO]  replication.index.perf: no checkpoint found
vault-azure-0 vault: 2020-05-19T21:28:41.680Z [INFO]  replication.index: no previous commit, forcing re-index: path-type=1
vault-azure-0 vault: 2020-05-19T21:28:42.012Z [INFO]  replication.index.local: no index pages found
vault-azure-0 vault: 2020-05-19T21:28:42.015Z [INFO]  replication.index.local: no checkpoint found
vault-azure-0 vault: 2020-05-19T21:28:42.015Z [INFO]  replication.index: no previous commit, forcing re-index: path-type=2
vault-azure-0 vault: 2020-05-19T21:28:42.087Z [INFO]  core: loaded wrapping token key
vault-azure-0 vault: 2020-05-19T21:28:42.087Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
vault-azure-0 vault: 2020-05-19T21:28:42.090Z [INFO]  core: no mounts; adding default mount table
vault-azure-0 vault: 2020-05-19T21:28:42.139Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
vault-azure-0 vault: 2020-05-19T21:28:42.140Z [INFO]  core: successfully mounted backend: type=system path=sys/
vault-azure-0 vault: 2020-05-19T21:28:42.140Z [INFO]  core: successfully mounted backend: type=identity path=identity/
vault-azure-0 vault: 2020-05-19T21:28:42.287Z [INFO]  core: successfully enabled credential backend: type=token path=token/
vault-azure-0 vault: 2020-05-19T21:28:42.287Z [INFO]  core: restoring leases
vault-azure-0 vault: 2020-05-19T21:28:42.287Z [INFO]  rollback: starting rollback manager
vault-azure-0 vault: 2020-05-19T21:28:42.290Z [INFO]  expiration: lease restore complete
vault-azure-0 vault: 2020-05-19T21:28:42.345Z [INFO]  identity: entities restored
vault-azure-0 vault: 2020-05-19T21:28:42.346Z [INFO]  identity: groups restored
vault-azure-0 vault: 2020-05-19T21:28:42.348Z [INFO]  mfa: configurations restored
vault-azure-0 vault: 2020-05-19T21:28:42.350Z [INFO]  core: stopping replication
vault-azure-0 vault: 2020-05-19T21:28:42.350Z [INFO]  core: closed sync connection
vault-azure-0 vault: 2020-05-19T21:28:42.350Z [INFO]  core: replication stopped
vault-azure-0 vault: 2020-05-19T21:28:42.350Z [INFO]  core: setting up replication
vault-azure-0 vault: 2020-05-19T21:28:42.354Z [ERROR] core: a merkle tree reindex is required before replication can be started
vault-azure-0 vault: 2020-05-19T21:28:42.354Z [INFO]  core: running replication reindexing: diff=false force=false toggle replication=false no flush=false trees=2
vault-azure-0 vault: 2020-05-19T21:28:42.536Z [INFO]  replication.index: starting storage scan
vault-azure-0 vault: 2020-05-19T21:28:42.548Z [INFO]  replication.index: key scanning complete: num_keys=10
vault-azure-0 vault: 2020-05-19T21:28:42.995Z [INFO]  core: replication reindex complete: fixed_pages=0
vault-azure-0 vault: 2020-05-19T21:28:43.035Z [INFO]  core: replicated cluster information not found or disabled, not activating client
vault-azure-0 vault: 2020-05-19T21:28:43.035Z [INFO]  core: replication setup finished
vault-azure-0 vault: 2020-05-19T21:28:43.051Z [WARN]  core: post-unseal upgrade seal keys failed: error="no recovery key found"
vault-azure-0 vault: 2020-05-19T21:28:43.070Z [INFO]  core: post-unseal setup complete
vault-azure-0 vault: 2020-05-19T21:28:43.216Z [INFO]  core: root token generated
vault-azure-0 vault: 2020-05-19T21:28:43.216Z [INFO]  core: pre-seal teardown starting
vault-azure-0 vault: 2020-05-19T21:28:43.216Z [INFO]  core: stopping replication
vault-azure-0 vault: 2020-05-19T21:28:43.216Z [INFO]  core: closed sync connection
vault-azure-0 vault: 2020-05-19T21:28:43.216Z [INFO]  core: replication stopped
vault-azure-0 vault: 2020-05-19T21:28:43.216Z [INFO]  rollback: stopping rollback manager
vault-azure-0 vault: 2020-05-19T21:28:43.349Z [INFO]  core: pre-seal teardown complete
vault-azure-0 vault: 2020-05-19T21:28:43.349Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-azure-0 vault: 2020-05-19T21:28:43.411Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
vault-azure-0 vault: 2020-05-19T21:28:43.412Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
vault-azure-0 vault: 2020-05-19T21:28:43.412Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:444
vault-azure-0 vault: 2020-05-19T21:28:43.412Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:444
vault-azure-0 vault: 2020-05-19T21:28:43.412Z [INFO]  core: entering standby mode
vault-azure-0 vault: 2020-05-19T21:28:43.415Z [INFO]  core: vault is unsealed
vault-azure-0 vault: 2020-05-19T21:28:43.415Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
vault-azure-0 vault: 2020-05-19T21:28:43.415Z [WARN]  core: attempted unseal with stored keys, but vault is already unsealed
vault-azure-0 consul: 2020/05/19 21:28:43 [INFO] agent: Synced check "vault:10.210.1.69:443:vault-sealed-check"
vault-azure-1 vault: 2020-05-19T21:28:37.235Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-azure-1 vault: 2020-05-19T21:28:42.235Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-azure-1 vault: 2020-05-19T21:28:42.380Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
vault-azure-1 vault: 2020-05-19T21:28:42.381Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
vault-azure-1 vault: 2020-05-19T21:28:42.381Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:444
vault-azure-1 vault: 2020-05-19T21:28:42.381Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:444
vault-azure-1 vault: 2020-05-19T21:28:42.381Z [INFO]  core: entering standby mode
vault-azure-1 vault: 2020-05-19T21:28:42.383Z [INFO]  core: vault is unsealed
vault-azure-1 vault: 2020-05-19T21:28:42.383Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
vault-azure-1 consul: 2020/05/19 21:28:42 [INFO] agent: Synced check "vault:10.210.1.70:443:vault-sealed-check"
vault-azure-1 vault: 2020-05-19T21:28:42.433Z [INFO]  core: acquired lock, enabling active operation
vault-azure-1 vault: 2020-05-19T21:28:42.652Z [INFO]  core: post-unseal setup starting
vault-azure-1 vault: 2020-05-19T21:28:42.690Z [INFO]  replication.wal: wal range recovered: first_wal=1 last_wal=10
vault-azure-1 vault: 2020-05-19T21:28:43.034Z [WARN]  replication.index.perf: unexpected number of index pages found: pages_restored=5 expect=256
vault-azure-1 vault: 2020-05-19T21:28:43.040Z [INFO]  replication.index.perf: no checkpoint found
vault-azure-1 vault: 2020-05-19T21:28:43.040Z [INFO]  replication.index: no previous commit, forcing re-index: path-type=1
vault-azure-1 vault: 2020-05-19T21:28:43.399Z [WARN]  replication.index.local: unexpected number of index pages found: pages_restored=6 expect=256
vault-azure-1 vault: 2020-05-19T21:28:43.401Z [INFO]  replication.index.local: checkpoint recovery complete
vault-azure-1 vault: 2020-05-19T21:28:43.401Z [INFO]  replication.wal: prepared to replay logs: diff=18446744073709551610 commit_index=16 last_index=10
vault-azure-1 vault: 2020-05-19T21:28:43.456Z [INFO]  core: loaded wrapping token key
vault-azure-1 vault: 2020-05-19T21:28:43.456Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
vault-azure-1 vault: 2020-05-19T21:28:43.460Z [INFO]  core: successfully mounted backend: type=system path=sys/
vault-azure-1 vault: 2020-05-19T21:28:43.460Z [INFO]  core: successfully mounted backend: type=identity path=identity/
vault-azure-1 vault: 2020-05-19T21:28:43.460Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
vault-azure-1 vault: 2020-05-19T21:28:43.478Z [INFO]  core: successfully enabled credential backend: type=token path=token/
vault-azure-1 vault: 2020-05-19T21:28:43.478Z [INFO]  core: restoring leases
vault-azure-1 vault: 2020-05-19T21:28:43.479Z [INFO]  rollback: starting rollback manager
vault-azure-1 vault: 2020-05-19T21:28:43.482Z [INFO]  expiration: lease restore complete
vault-azure-1 vault: 2020-05-19T21:28:43.484Z [INFO]  identity: entities restored
vault-azure-1 vault: 2020-05-19T21:28:43.485Z [INFO]  identity: groups restored
vault-azure-1 vault: 2020-05-19T21:28:43.486Z [INFO]  mfa: configurations restored
vault-azure-1 vault: 2020-05-19T21:28:43.514Z [INFO]  core: stopping replication
vault-azure-1 vault: 2020-05-19T21:28:43.514Z [INFO]  core: closed sync connection
vault-azure-1 vault: 2020-05-19T21:28:43.514Z [INFO]  core: replication stopped
vault-azure-1 vault: 2020-05-19T21:28:43.514Z [INFO]  core: setting up replication
vault-azure-1 vault: 2020-05-19T21:28:43.517Z [ERROR] core: a merkle tree reindex is required before replication can be started
vault-azure-1 vault: 2020-05-19T21:28:43.517Z [INFO]  core: reindex is running in background, replication will be started once reindex completes
vault-azure-1 vault: 2020-05-19T21:28:43.517Z [INFO]  core: replication setup finished
vault-azure-1 vault: 2020-05-19T21:28:43.517Z [INFO]  core: running replication reindexing: diff=false force=false toggle replication=true no flush=false trees=1
vault-azure-1 vault: 2020-05-19T21:28:43.551Z [WARN]  core: post-unseal upgrade seal keys failed: error="no recovery key found"
vault-azure-1 vault: 2020-05-19T21:28:43.574Z [INFO]  core: post-unseal setup complete
vault-azure-1 vault: 2020-05-19T21:28:43.581Z [INFO]  replication.index: starting storage scan
vault-azure-1 vault: 2020-05-19T21:28:43.611Z [INFO]  replication.index: key scanning complete: num_keys=7
vault-azure-1 consul: 2020/05/19 21:28:43 [INFO] agent: Synced service "vault:10.210.1.70:443"
vault-azure-1 consul: 2020/05/19 21:28:43 [INFO] agent: Synced check "vault:10.210.1.70:443:vault-sealed-check"
vault-azure-1 vault: 2020-05-19T21:28:43.660Z [INFO]  replication.index: re-index all keys, root hash mis-match
vault-azure-1 vault: 2020-05-19T21:28:43.660Z [INFO]  replication.index.perf: re-index all keys, root hash mis-match
vault-azure-1 vault: 2020-05-19T21:28:43.740Z [INFO]  core: replication index repaired, starting replication
vault-azure-1 vault: 2020-05-19T21:28:43.740Z [INFO]  core: stopping replication
vault-azure-1 vault: 2020-05-19T21:28:43.740Z [INFO]  core: closed sync connection
vault-azure-1 vault: 2020-05-19T21:28:43.740Z [INFO]  core: replication stopped
vault-azure-1 vault: 2020-05-19T21:28:43.740Z [INFO]  core: setting up replication
vault-azure-1 vault: 2020-05-19T21:28:43.742Z [INFO]  replication.index: clean merkle tree started
vault-azure-1 vault: 2020-05-19T21:28:43.742Z [WARN]  replication: replication is in a failure state, please reindex the merkle tree
vault-azure-1 vault: 2020-05-19T21:28:43.793Z [INFO]  core: replicated cluster information not found or disabled, not activating client
vault-azure-1 vault: 2020-05-19T21:28:43.793Z [INFO]  core: replication setup finished
vault-azure-1 vault: 2020-05-19T21:28:43.802Z [INFO]  replication.index: clean merkle tree complete: keys_cleaned=0
vault-azure-1 vault: 2020-05-19T21:28:43.832Z [INFO]  core: replication reindex complete: fixed_pages=1
vault-azure-1 vault: 2020-05-19T21:29:46.986Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-azure-1 vault: 2020-05-19T21:30:23.493Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-azure-1 vault: 2020-05-19T21:30:29.217Z [INFO]  core.autoseal: seal configuration missing, not initialized: seal_type=recovery
vault-azure-2 vault: 2020-05-19T21:28:37.200Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-azure-2 vault: 2020-05-19T21:28:42.200Z [INFO]  core: stored unseal keys supported, attempting fetch
vault-azure-2 vault: 2020-05-19T21:28:42.354Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
vault-azure-2 vault: 2020-05-19T21:28:42.354Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
vault-azure-2 vault: 2020-05-19T21:28:42.354Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:444
vault-azure-2 vault: 2020-05-19T21:28:42.355Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:444
vault-azure-2 vault: 2020-05-19T21:28:42.355Z [INFO]  core: entering standby mode
vault-azure-2 vault: 2020-05-19T21:28:42.356Z [INFO]  core: vault is unsealed
vault-azure-2 vault: 2020-05-19T21:28:42.356Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
vault-azure-2 consul: 2020/05/19 21:28:42 [INFO] agent: Synced check "vault:10.210.1.71:443:vault-sealed-check"

Commands run on the second node (vault-azure-1) show opposing seal states:

$ curl -sk https://localhost:8200/v1/sys/seal-status | jq
{
  "type": "azurekeyvault",
  "initialized": false,
  "sealed": true,
  "t": 0,
  "n": 0,
  "progress": 0,
  "nonce": "",
  "version": "",
  "migration": false,
  "recovery_seal": true,
  "storage_type": "consul"
}
$ curl -sk https://localhost:8200/v1/sys/health | jq
{
  "initialized": true,
  "sealed": false,
  "standby": false,
  "performance_standby": false,
  "replication_performance_mode": "disabled",
  "replication_dr_mode": "disabled",
  "server_time_utc": 1589925036,
  "version": "1.3.4+prem",
  "cluster_name": "vault-cluster-01234567",
  "cluster_id": "591ea3b4-54dc-41a1-b3d6-6c6cc7544e62",
  "last_wal": 13
}

And this is what it looks like behind a load balancer that has sealedcode=200&uninitcode=200 on the health probe URL:

$ vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.3.4+prem
Cluster Name             vault-cluster-01234567
Cluster ID               591ea3b4-54dc-41a1-b3d6-6c6cc7544e62
HA Enabled               true
HA Cluster               https://10.210.1.70:444
HA Mode                  active
Last WAL                 13
$ vault status
Key                      Value
---                      -----
Recovery Seal Type       azurekeyvault
Initialized              false
Sealed                   true
Total Recovery Shares    0
Threshold                0
Unseal Progress          0/0
Unseal Nonce             n/a
Version                  n/a
HA Enabled               true

mbrancato avatar May 19 '20 22:05 mbrancato

Yeah, I suspect it's a general problem (not azure specific) related to bringing online many nodes for a new cluster at once. Usually (at least pre-k8s) people bring online one node, then init, then add nodes, rather than what you're doing here. My theory is that physical.Cache.Get is caching the nil value for the seal config because it's called prior to the init, and the cache entry isn't getting updated because a different node sees the init request. I think we should add the seal config to cacheExceptionPaths.

ncabatoff avatar May 20 '20 13:05 ncabatoff

Typically in these scenarios i try to reprovision all nodes completely anew (no local raft data) - assuming ofc that there is a leader node and that it is initialised - I then proceed to rejoin the others one at a time which typically resolve any other general issues that may be specific to all other standby nodes.

Hey @mbrancato - I was wondering is this issue still applicable for you and have you rested your flows in more recent versions?

aphorise avatar Sep 03 '22 14:09 aphorise

@aphorise no offense, but I think you may have missed the context here. These were newly provisioned VMs from a cold boot. They all boot at once and should auto-unseal - so we cannot rejoin others one at a time. They are all clones of each other using scripts / VM images initially provided by HashiCorp, but modified to work in our environment at the time. They were running on GCP managed instance groups and Azure virtual machine scale sets. These both work similarly where you have 1 VM image, and you scale a cluster to N nodes - it is the responsibility of each node to bootstrap and if part of a cluster, figure out how to join. When you start up a cluster, new VMs are created, all at once. There was some race happening here where sometimes, a node would join the cluster but not auto-unseal. As part if plans to use immutable infrastructure as well, these VMs were going to be replaced frequently.

Anyway, I'm not sure if this has ever been specifically fixed or investigated. At the time, we demoed and showed this to the HashiCorp account team. I'm no longer working on this.

mbrancato avatar Sep 03 '22 15:09 mbrancato

No offence taken :+1:

The point I was trying to make is that my experience as long as you start out with a leader on the cluster then scaling up / down from that 1 (+2 N) to 3, 5, 7 may be easier than trying to init / join everything together from the onset all at once.

@mbrancato - thanks for the update. Do you know if anything further is pending here?

aphorise avatar Sep 05 '22 20:09 aphorise

Hey @mbrancato if nothing further is expected here then let us know if your ok to proceed with closing this issue.

aphorise avatar Sep 17 '22 19:09 aphorise

Just to clarify, I opened this bug after confirming with our enterprise account team, now years ago. I'm no longer in that role but AFAIK this was never fixed. It does appear to be a valid race condition somewhere in auto-unseal. It's not impacting me currently because I've moved on, so in my opinion HashiCorp can handle this issue however it prefers.

mbrancato avatar Sep 17 '22 19:09 mbrancato

@mbrancato Thanks for the update! We had issues reproing the issue initially; @ncabatoff has a PR that may fix the issue, however, because we couldn't repro it we aren't 100% sure. We're willing to keep working on this on a "best effort" basis, and if anyone else who's also seeing this issue can help us reproduce it, that'd sure help a lot. Thanks!

heatherezell avatar Nov 04 '22 19:11 heatherezell

Yeah, I suspect it's a general problem (not azure specific) related to bringing online many nodes for a new cluster at once. Usually (at least pre-k8s) people bring online one node, then init, then add nodes, rather than what you're doing here. My theory is that physical.Cache.Get is caching the nil value for the seal config because it's called prior to the init, and the cache entry isn't getting updated because a different node sees the init request. I think we should add the seal config to cacheExceptionPaths.

We have merged PR #21223 with the fix suggested above.

Thank you again @mbrancato for filing this issue.

victorr avatar Jun 19 '23 14:06 victorr