vault
vault copied to clipboard
Vault disk full failure modes
When Vault is configured with a file-based audit backend and the disk fills up, the Vault leader becomes unhealthy but fails to step-down despite having standby instances available.
Environment:
- Vault Version: 0.8.2
- Operating System/Architecture: CentOS x64
Vault Config File: N/A
Startup Log Output: N/A
Expected Behavior: When a Vault leader is no longer able to serve requests it should step-down and allow a standby to serve requests.
Actual Behavior:
Disk-full prevented Vault from writing to the audit log so it didn't automatically step down, issuing a manual vault step-down
also failed due to being unable to write to the audit log. Killing the Vault leader process was the only way to recover.
Steps to Reproduce:
- Create a Vault HA cluster
- Enable the file audit backend
- Fill the disk (e.g.
cat /dev/urandom > /path/to/audit/log/volume/temp.bin
) - Observe Vault leader failing to serve requests (e.g.
vault read secret/foo
) - Observe unable to step down cleanly (e.g.
vault step-down
)
Important Factoids: N/A
References: N/A
Affects us with Vault version 1.5.5 as well
Issues that are not reproducible and/or not had any interaction for a long time are stale issues. Sometimes even the valid issues remain stale lacking traction either by the maintainers or the community. In order to provide faster responses and better engagement with the community, we strive to keep the issue tracker clean and the issue count low. In this regard, our current policy is to close stale issues after 30 days. Closed issues will still be indexed and available for future viewers. If users feel that the issue is still relevant but is wrongly closed, we encourage reopening them.
Please refer to our contributing guidelines for details on issue lifecycle.
Just ran into this on Vault 1.7.2, definitely still an issue.
I ran into this on 1.7.2 as well. Not sure if the issue has been fixed in later versions.
Hey aren't these typically OS / platform level concerns? - So if there are ELK beats or Prom exporters specific to Host / OS environment of the service (Vault) then any system excesses are monitored and altered separately.
There's also other non-file based outputs (socket) and stdout logging as in the case of containers where disk full would be unlikely.
Other preventative can be performing log rotations more frequent especially as that would be shorter (CPU time) when performed more frequently than once per day for the entire days worth of logs.
I think reporting this consistently may a challenge particularly in those cases where it's not possible to get accurate reports of sizes or if increases in reported fluctuations are not caused by Audits or even Vault. If it was possible - perhaps a one way of doing it could be to combine all avaible, free data, used (audit + other vault data) in some rough measure wherever some storage details are avaible.
Hey aren't these typically OS / platform level concerns?
Yes, but the specific behaviour I'm expecting here is when Vault attempts to write to the audit log, encounters an error, and thus refuses to service the request because audit logging failed is for Vault to then trigger a leader election.
I'm not expecting Vault to be monitoring disk space/etc, just to step down as a leader if it encounters a fatal audit log error.
In my particular instance, regular monitoring and logrotate settings/etc were fine for normal use, the issue only came up when some Vault clients had some bugs (rate limit configuration is something that would also help a lot here) that ended up making lots of requests in a very short amount of time, for example https://github.com/hashicorp/vault/issues/12566 is one such issue I found with Vault Agent.
In the above example, if the server had stepped down as leader it would have at least allowed a longer duration before a total service outage occurred by which time the bad clients could be addressed / rate limits implemented.
Another reason this would be good to have is in some virtualised environments, it's not unheard of for a failing VM host or other similar event to cause a VMs disks to flick to read only (after a few errors Linux tends to give up and remount the filesystem as read only). I've thankfully not run into this on a Vault server before, but I have on a number of other VMs and I'd expect from Vault's perspective that the log filesystem becoming read-only would be more or less the same as running out of disk space. Having to wait for an operator to manually kill a bad Vault VM rather than having it auto-fail over is less than ideal.
Note that in my instance, I'm using Consul as the Vault storage backend, so other storage considerations were not an issue. I'm not sure what the current failure mode of being unable to write to the raft folder is, but I imagine users would expect failures of the raft storage to act similarly (if it doesn't already). I'd like to imagine anyone running Vault with raft storage would have separate filesystems for logs vs raft data to avoid logs filling up from impacting on raft (indeed in my setup this is also the case for logs vs rest of filesystem), but the other considerations still apply.
I think the important piece of this issue is that the Audit Log feature is both critical and blocking. Vault should not serve any requests if it is unable to accurately maintain the audit log. My opinion is this SHOULD include attempted requests because those are still relevant even if Vault was unable to fulfill them.
If Vault can't write to the audit log (disk full, disk failure, config error, etc) then it should no longer be trusted to continue running as the leader node and should halt (e.g. os.Exit(74) // EX_IOERR
). Graceful step-down would be nicer, but that includes writing the step-down event to the audit log... Since we can't rely on being able to write to the logs with a clear error message I think the best option is to exit with a well-defined error code to aid in debugging.
The longer Vault is running unsealed in a bad state without being able to log audit events: the bigger the risk becomes. I think we should fail closed here. This has the added advantage of better uptime for HA clusters as well.
Greetings! I appreciate all of the valuable input in this thread and will do my best to close the loop here. I have gone ahead and removed the bug
label in favor of enhancement
, as this appears to be a request for an improvement to the way Vault currently (intentionally) handles a failure scenario.
Before going into much detail about the rationale, I would like to point out the [currently subtle] recommendation to enable multiple audit devices, which should prevent this situation from being an issue.
I have a docs PR in progress (yet to be opened), which should improve the clarity around enabling audit devices and support the recommendation of having more than one audit device in production.
The assertion that Vault should fail closed when Vault fails to audit is completely justified and valid. The current behavior is attempting to achieve that, though it may fall short in some areas. There has been some internal discussion about the ways to solve some of the deficiencies with the current approach. Further design/discussion/prioritization needs to take place before any solution can be made into a reality.
Since this issue can immediately be resolved by increasing the redundancy of audit solutions, I am inclined to reclassify this as a UX "enhancement." Please stay tuned for the docs PR and any other future updates!
Still an issue with version 1.12.1+ent
.
When configuring an audit device (a file on the local disk) and making it unavailable, Vault stops working, but the health-check is okay:
$ vault kv get kv/my-1
Error making API request.
URL: GET http://127.0.0.1:8200/v1/sys/internal/ui/mounts/kv/my-1
Code: 500. Errors:
* local node not active but active cluster node not found*
The health-check returns healthy:
$ curl -v "$VAULT_ADDR/v1/sys/health?standbycode=200"
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8200 (#0)
> GET /v1/sys/health?standbycode=200 HTTP/1.1
> Host: 127.0.0.1:8200
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Cache-Control: no-store
< Content-Type: application/json
< Strict-Transport-Security: max-age=31536000; includeSubDomains
< Date: Thu, 05 Jan 2023 09:03:51 GMT
< Content-Length: 378
<
{"initialized":true,"sealed":false,"standby":true,"performance_standby":false,"replication_performance_mode":"unknown","replication_dr_mode":"unknown","server_time_utc":1672909431,"version":"1.12.1+ent","cluster_name":"vault_one_nodes","cluster_id":"4b9002db-a28b-982f-6413-3648d557cbd6","license":{"state":"autoloaded","expiry_time":"2023-01-28T07:04:53Z","terminated":false}}
* Connection #0 to host 127.0.0.1 left intact
I would like the health-check to report a non-200 code, maybe 502
, since that's not used.
-
Doing a discouraged feature / flag like:
VAULT_AUDIT_ERRORS_IGNORE
and or an equivalent HCL parameter to the same effect may be an option here (use at your own risk). -
On the http response (
200
) of/v1/sys/health
maybe an equivalent query-string like?auditerrorcode=503
that's a higher precedent could be provided (more pressing that?standycode=200
).
Maybe PR's can be drafted for these enhancements (ignoring audit failure & response on health check)