Significantly increased DynamoDB write usage after upgrading from Vault 1.14.1 to 1.15.4
Describe the bug After upgrading Vault from 1.14.1 to 1.15.4 we noticed significantly increased DynamoDB write usage which also leads to service interruptions where Vault is not responding to HTTP requests.
We run Vault for years without any big configuration changes on 3 EC2 instances in an Autoscaling group with an ALB in front of it and DynamoDB as the storage backend.
Before the version upgrade, we never experienced issues with a provisioned write capacity of 20 with Autoscaling enabled to 100. Now after the upgrade, we have regular spikes to several hundred. So we increased the provisioned write capacity to 50 and allow autoscaling to 1000 but the autoscaling is too slow so we have a service interruption on every spike.
As part of the investigation, we added max_parallel to the storage configuration:
storage "dynamodb" {
ha_enabled = "true"
region = "eu-central-1"
table = "xxxxxxxxxxx"
max_parallel = "25"
}
but it didn't improve the write usage.
We already set log_level = "trace" but didn't yet found a trigger for the spike.
Maybe its a background task?
The DynamoDB table is not that big. Item count: 148,682, Table size: 68.6 megabytes 6 Kubernetes clusters are connected, 4 of them with Consul PKI. Vault client count: 247
during such a spike, which takes up to an hour, the following keys of the DynamoDB table are accessed
Expected behavior No spikes in write usage, or at least some log messages to get an idea of whats happening at this time. Respecting max_parallel setting in DynamoDB storage config
We had similar issues, we upgraded the vault from 0.10 to 1.13.3 and found the Dynamodb reads and writes had a massive increase. After upgrading to 1.15.4, the writes spiked to the roof.
Finding a similar issue, we went from 1.7.0 to 1.15.6 and saw a big spike in write usage coupled with write throttled requests which (I believe) are causing:
Mar 05 17:21:46 ip-10-6-1-48 vault[416]: 2024-03-05T17:21:46.875Z [ERROR] storage.dynamodb: error renewing leadership lock:
Mar 05 17:21:46 ip-10-6-1-48 vault[416]: error=
Mar 05 17:21:46 ip-10-6-1-48 vault[416]: | RequestError: send request failed
Mar 05 17:21:46 ip-10-6-1-48 vault[416]: | caused by: Post "https://dynamodb.us-east-1.amazonaws.com/": read tcp 10.6.1.48:37918->3.218.180.13:443: read: connection reset by peer
Mar 05 17:21:46 ip-10-6-1-48 vault[416]:
Mar 05 19:21:51 ip-10-6-1-48 vault[416]: 2024-03-05T19:21:51.876Z [WARN] core: leadership lost, stopping active operation
This drops leadership, the secondary takes over but by then any requests pending to vault have failed and have to be redone. It's possible in our situation that our write capacity is limited but we never had anything like this occur on the old version, and rolling back to it (using a backup of the dynamodb table from pre-upgrade) works cleanly. Seems directly related.
We have dynamodb setup with provisioned scaling, minimum 5 write units and max 100 write units. At rest it sits around .1 unit. Since this upgrade, we see a spike up to 30 units (and accompanied throttles) almost instantly, within 1 minute but then it's 1 minute metric aggregation so for all we know it happens within 1 second. Presuming our throttle issue is coming from the rate of this and hopefully we can just increase our baseline units to accommodate, however this means we massively over provision for the occasions when we need to deploy or use vault in this way. This is a waste of money for the non active periods...
Hello folks, we have also found similar issues regarding upgrading to versions > 1.15.x. We are currently running 1.16.2 and from what we can tell the issue is getting worse overtime.
As soon as we bumped Vault from 1.14.4 to 1.15.0, we started seeing throttling in writing requests on DynamoDB.
Deep diving into Vault metrics, we more specifically around vault_core_handle_requests we noticed a new peek on the 6th of February 2024, right after bumping to Vault 1.15.5. Since that date, our latency started to get worse and worse.
It is worth mentioning that other metrics such as vault_core_check_token, vault_core_fetch_acl_and_token are also reflecting the latency increase. Now, I don't know if this is a symptom or the root cause.
Considerations
- We are currently using DynamoDB OnDemand;
- We have not set
max_parallelfor DynamoDB Storage; - The traffic we receive in Vault are pretty much the same as before;
- The number of existing entities are ~32k
Any advice would be extremely helpful. Thanks for the help
Same problem. Not sure if max_parallel is actually working as expected. We have it set to 30 and 3x vault instances. But we a throttled big times. Feels like max_parallel is set per vault, and if you restart all three, they all trying to read secrets. We are on community version.
What helps, is to increase RCU and WCU for Vault.
Either there is a bug in Lock https://github.com/hashicorp/vault/blob/2db2a9fb5d890b213a2c05aa5698c63560399774/physical/dynamodb/dynamodb.go#L938 or max_parallel not well understood
Our Dynamodb size is ~ 400KB and 500 items.
DynamoDB scaling is slow. High level understanding - the more items in Vault - the higher RCU/WCU needed. The question is, why it is so noticeable in latest versions.
Same issue is happening in 1.17.2
Just upgraded to 1.15.5 and suffering from the same issue
We tested v1.18.2 in test environment and still see the same throttling bug wrt dynamodb.