vault icon indicating copy to clipboard operation
vault copied to clipboard

Add option 'elide_list_responses' to audit backends

Open maxb opened this issue 1 year ago • 9 comments

This PR relates to a feature request logged through HashiCorp commercial support: https://app.asana.com/0/112957393038545/1201850823939444/f

Vault lacks pagination in its APIs. As a result, certain list operations can return very large responses. The user's chosen audit sinks may experience difficulty consuming audit records that swell to tens of megabytes of JSON.

In our case, one of the systems consuming audit log data could not cope, and failed.

The responses of list operations are typically not very interesting, as they are mostly lists of keys, or, even when they include a "key_info" field, are not returning confidential information. They become even less interesting once HMAC-ed by the audit system.

Some example Vault "list" operations that are prone to becoming very large in an active Vault installation are:

auth/token/accessors/
identity/entity/id/
identity/entity-alias/id/
pki/certs/

In response, I've coded a new option that can be applied to audit backends, elide_list_responses. When enabled, response data is elided from audit logs, only when the operation type is "list".

For added safety, the elision only applies to the "keys" and "key_info" fields within the response data - these are conventionally the only fields present in a list response - see logical.ListResponse, and logical.ListResponseWithInfo. However, other fields are technically possible if a plugin author writes unusual code, and these will be preserved in the audit log even with this option enabled.

The elision replaces the values of the "keys" and "key_info" fields with an integer count of the number of entries. This allows even the elided audit logs to still be useful for answering questions like "Was any data returned?" or "How many records were listed?".


~~There are obviously a few things missing in this PR - changelog, documentation, tests.~~

~~For now, the intent is to start a conversation with HashiCorp about the proposed implementation, and get that validated before filling in the rest.~~

maxb avatar Nov 28 '22 16:11 maxb

Hi Max,

This makes sense, and indeed I was in a discussion just last week re this issue. I wonder if rather than doing it globally on a per audit device basis, it would be better to make it configurable per request path? So just as you can tune a mount to specify keys in the request/response object that shouldn't be hmac'd, maybe you could tune a mount to specify LIST endpoints whose responses should be suppressed.

I'm not asking you to make this change at this point, this is just me thinking out loud. It's not mutually exclusive with your global approach anyway, maybe we want both.

ncabatoff avatar Nov 28 '22 16:11 ncabatoff

That's an interesting question...

In one way, it would make things easier for me, in that I'm already dreading how we manage to deploy this in production, even if it's merged ... because currently Vault has no way to reconfigure an existing audit device. That's a problem, since toggling this new option would require fully removing the existing audit device and re-adding it. This causes two bad things to happen: 1) loss of audit data whilst the removal and addition is ongoing; 2) loss of the previous audit salt, so old audit logs cannot have values looked up via the sys/audit-hash/ API.

(I have been wondering whether it would be practical to add live reconfiguration to audit devices...)

However, there are other reasons I think setting this at the audit device level is the better:

  • Some users may want to turn this on for some audit devices, but off for others - for example an audit stream being transmitted to something that can't handle giant data records might make use of this, but the user might still want to capture unelided records to a file on disk via another audit device.

  • I expect the usual motivation for this feature to be because a particular audit device handles giant records poorly, or breaks entirely. In such a case, you really want to be confident you've fully mitigated the problem, rather than having to reactively respond to particular cases of giant list responses, which may newly arise as usage patterns of Vault change. In particular, if delegated access to create new mounts - e.g. via Vault namespaces - is in use, it may be difficult to ensure that suitable mount tuning options are used for all mounts.

  • Not all mounts even allow tuning at all - e.g. today, even in Vault 1.12, identity/ and sys/ can't be tuned at all. I do remember seeing a PR where someone was working on fixing this, so maybe it'll no longer be an issue in future, but in the short term, this is also another issue with the mount tuning approach.

I do agree that more flexibility is often a good thing - people do tend to come up with use-cases that beyond what designers originally imagined - but if this was to be able to be set in multiple places, we'd have to figure out overriding/inheritance semantics, which could make it complex to understand how to use.

Ultimately, I think the reason that a simple boolean per audit device could be sufficient in this case, is because list operations are by design constrained to work in a fairly similar way throughout Vault, regardless of the particular backend involved - they contain only a keys (and in select cases key_info) field in data, and I can't think of any case where a list would contain strongly confidential data.

maxb avatar Nov 30 '22 23:11 maxb

I'm persuaded. We can leave finer-grained configuration for a future iteration if needed.

(I have been wondering whether it would be practical to add live reconfiguration to audit devices...)

We're open to the idea of this, i.e. I don't think there's any reason why it hasn't been done until now other than no one's asked for it. Should be its own PR of course, and since it will require changes on the enterprise repo side, we'd have to do some work even if you got the ball rolling, but I don't think that'll be a problem, it just may not happen as quickly as one might like.

ncabatoff avatar Dec 09 '22 21:12 ncabatoff

Just a note, but in the PKI plugin, not only would this be a good idea (as cert lists can get long), but finer grained config would be good too: we recommend un-HMACing the certificate field to allow requests for cert issuance to be audited including the full body of the certificate (which is generally public information anyways).

However, various endpoints like /cert/crl return non-certificate data in that parameter as well, and the CRL in particular can grow large enough to panic syslog-only audit setups. The CRL is global and isn't per-user, so there's no real need to un-HMAC the certificate field on this response. My 2c.

cipherboy avatar Dec 12 '22 16:12 cipherboy

@cipherboy Perhaps we should deprecate those endpoints in the PKI secrets engine and add replacements that are separate from the /cert/+ APIs? I had just assumed, that given how the audit_non_hmac_* options were designed, someone had at some point made a concious decision that plugin authors were required to exercise consistency in the naming of their request/response fields across different endpoints, so that the audit configuration complexity could be limited. I've not seen this written down anywhere, it's just a feeling I get from looking at the shape of the implementation. I fear much finer-grained configuration would start getting too cumbersome to use, especially as the precedent is to configure this via the API, rather than a configuration file.

maxb avatar Dec 20 '22 23:12 maxb

@maxb There are already replacements, under the newer /issuer/:ref endpoints, that do not write the CRL to the certificate field.

I fear that any automation which pulls from cert/crl's JSON endpoint would need to be migrated if we deprecate it, which would be less than ideal and prone to breaking (likely a large number of) customers.

cipherboy avatar Dec 20 '22 23:12 cipherboy

@maxb is attempting to deploy a commit to the HashiCorp Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Dec 21 '22 10:12 vercel[bot]

I fear that any automation which pulls from cert/crl's JSON endpoint would need to be migrated if we deprecate it, which would be less than ideal and prone to breaking (likely a large number of) customers.

Well, I meant, deprecate and never remove, similar to how there's no plan that I know of to ever remove sys/renew even though it is canonically now at sys/leases/renew, or the undocumented obsolete identity/alias and identity/persona endpoints... but, this is beyond the scope of this PR, which deliberately avoids getting entangled in such wider complex configuration concerns! I'd be happy to continue the conversation in a separate issue if you like, though.


Back on topic for this PR, I've written tests... and, ahem, fixed the bug they revealed.

I'll work on the documentation update next.

maxb avatar Dec 21 '22 10:12 maxb

Documentation added.

I believe this PR is now ready for review and merge.

maxb avatar Dec 21 '22 14:12 maxb

Thanks for the feedback! Let me explain why I did it this way, and then you can tell me which trade-off you would prefer to prioritise.

Currently, there is a loose convention that every list response will contain a keys list, and a select few will also include a key_info map. I know of no list responses that contain other data fields, but there is no guarantee that this will always be the case - I wanted to have simple defined behaviour if other fields were present, and chose to just leave them entirely intact.

If I was to write a new field keys_count, I'd need to decide what should happen if the original response already contained a keys_count - overwrite it? Skip the elision process? I could dodge around this a bit by writing an elided_data_keys_count field into the root of the response (outside of the data map), I guess.

I also quite like the appeal of still being able to tell whether a particular response contained a key_info, or only a keys field. That is accomplished in quite a natural way in the current PR, but would need some kind of other additional field, if not done that way.

I do like the algorithmic simplicity of the current approach, but I'm OK with swapping it out for something a bit more complex and verbose, if you feel it's going to confuse some users too much.

maxb avatar Jan 10 '23 19:01 maxb

Currently, there is a loose convention that every list response will contain a keys list, and a select few will also include a key_info map. I know of no list responses that contain other data fields, but there is no guarantee that this will always be the case - I wanted to have simple defined behaviour if other fields were present, and chose to just leave them entirely intact.

That's fair, and it's certainly a concern I had too in proposing this. In my mind it seemed unlikely, but the future is always uncertain.

If I was to write a new field keys_count, I'd need to decide what should happen if the original response already contained a keys_count - overwrite it? Skip the elision process? I could dodge around this a bit by writing an elided_data_keys_count field into the root of the response (outside of the data map), I guess.

If we go this route, I'd say overwriting keys_count is a nonstarter, skipping elision is an option, and putting it outside the data map would be my first choice.

I also quite like the appeal of still being able to tell whether a particular response contained a key_info, or only a keys field. That is accomplished in quite a natural way in the current PR, but would need some kind of other additional field, if not done that way.

I do like the algorithmic simplicity of the current approach, but I'm OK with swapping it out for something a bit more complex and verbose, if you feel it's going to confuse some users too much.

No, I'm persuaded. The concern I have may not arise (people may not care enough about list responses to be digesting them in their pipelines), the new behaviour you're introducing is opt-in, so people using it can always adapt their ingestion code/schemas, and I don't want to impose extra complexity before we know that it's warranted.

ncabatoff avatar Jan 11 '23 21:01 ncabatoff