vault icon indicating copy to clipboard operation
vault copied to clipboard

DynamoDB backed HA fails to release locks

Open dhumphries-sainsburys opened this issue 10 months ago • 0 comments

Describe the bug Our clusters use s3 for storage and dynamoDB for ha_storage in a 3 replica configuration on EKS. We have had a few instances where an underlying node has failed and when this happens and vault happens to be running on that host we are seeing vault fail. Looking at this issue it appears in instances where the master node gets disabled by any method other than someone terminating the pod the other replicas fail to take control and as a result vault stops working. This looks to be due to the lockfile in dynamodb preventing a new one taking over despite the fact that the code suggest a TTL for this being 15s

Things we have tried that all resulted in failure of vault

  • Removing network interfaces from underlying hosts
  • Removing security groups
  • Disabling kubelet on underlying host

To Reproduce Steps to reproduce the behavior:

  1. Install vault using the included config
  2. Disable an underlying host somehow that hosts the current master (removing the security groups allowing communication probably the easiest)
  3. See that the other vault hosts do not take over (I timed up to 30 minutes but docs seem to suggest 15s TTL)

Expected behavior Within 15 seconds of the master vault becoming unavailable or unable to service requests one of the others takes over

Environment:

  • Vault Server Version (retrieve with vault status):
/vault/config # vault status -tls-skip-verify
WARNING! VAULT_ADDR and -address unset. Defaulting to https://127.0.0.1:8200.
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.16.1
Build Date             2024-04-03T12:35:53Z
Storage Type           s3
Cluster Name           vault-cluster-9c8fea11
Cluster ID             52acdb68-3327-bed2-f56a-f1eab38f7dbd
HA Enabled             true
HA Cluster             https://vault:8201
HA Mode                standby
Active Node Address    https://vault:8200
  • Vault CLI Version (retrieve with vault version):
/vault/config # vault version
Vault v1.16.1 (6b5986790d7748100de77f7f127119c4a0f78946), built 2024-04-03T12:35:53Z
  • Server Operating System/Architecture: bottlerocket-aws-k8s-1.28-x86_64-v1.19.4-4f0a078e

Vault server configuration file(s):

{"api_addr":"https://vault:8200","default_lease_ttl":"4320h","ha_storage":{"dynamodb":{"ha_enabled":"true","region":"eu-west-1","table":"vault-ha-storage"}},"listener":[{"tcp":{"address":"0.0.0.0:8200","tls_cert_file":"/vault/tls/server.crt","tls_key_file":"/vault/tls/server.key"}}],"max_lease_ttl":"4320h","service_registration":{"kubernetes":{"namespace":"vault"}},"storage":{"s3":{"bucket":"vault-lab-ie-core","region":"eu-west-1"}},"telemetry":{"statsd_address":"localhost:9125"},"ui":true}

Additional context Add any other context about the problem here.

dhumphries-sainsburys avatar Apr 22 '24 16:04 dhumphries-sainsburys