vault
vault copied to clipboard
DynamoDB backed HA fails to release locks
Describe the bug Our clusters use s3 for storage and dynamoDB for ha_storage in a 3 replica configuration on EKS. We have had a few instances where an underlying node has failed and when this happens and vault happens to be running on that host we are seeing vault fail. Looking at this issue it appears in instances where the master node gets disabled by any method other than someone terminating the pod the other replicas fail to take control and as a result vault stops working. This looks to be due to the lockfile in dynamodb preventing a new one taking over despite the fact that the code suggest a TTL for this being 15s
Things we have tried that all resulted in failure of vault
- Removing network interfaces from underlying hosts
- Removing security groups
- Disabling kubelet on underlying host
To Reproduce Steps to reproduce the behavior:
- Install vault using the included config
- Disable an underlying host somehow that hosts the current master (removing the security groups allowing communication probably the easiest)
- See that the other vault hosts do not take over (I timed up to 30 minutes but docs seem to suggest 15s TTL)
Expected behavior Within 15 seconds of the master vault becoming unavailable or unable to service requests one of the others takes over
Environment:
- Vault Server Version (retrieve with
vault status
):
/vault/config # vault status -tls-skip-verify
WARNING! VAULT_ADDR and -address unset. Defaulting to https://127.0.0.1:8200.
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 5
Threshold 3
Version 1.16.1
Build Date 2024-04-03T12:35:53Z
Storage Type s3
Cluster Name vault-cluster-9c8fea11
Cluster ID 52acdb68-3327-bed2-f56a-f1eab38f7dbd
HA Enabled true
HA Cluster https://vault:8201
HA Mode standby
Active Node Address https://vault:8200
- Vault CLI Version (retrieve with
vault version
):
/vault/config # vault version
Vault v1.16.1 (6b5986790d7748100de77f7f127119c4a0f78946), built 2024-04-03T12:35:53Z
- Server Operating System/Architecture:
bottlerocket-aws-k8s-1.28-x86_64-v1.19.4-4f0a078e
Vault server configuration file(s):
{"api_addr":"https://vault:8200","default_lease_ttl":"4320h","ha_storage":{"dynamodb":{"ha_enabled":"true","region":"eu-west-1","table":"vault-ha-storage"}},"listener":[{"tcp":{"address":"0.0.0.0:8200","tls_cert_file":"/vault/tls/server.crt","tls_key_file":"/vault/tls/server.key"}}],"max_lease_ttl":"4320h","service_registration":{"kubernetes":{"namespace":"vault"}},"storage":{"s3":{"bucket":"vault-lab-ie-core","region":"eu-west-1"}},"telemetry":{"statsd_address":"localhost:9125"},"ui":true}
Additional context Add any other context about the problem here.