fix(vault): fix vault config neg_ttl behavior
As per documentation, neg_ttl specifies the time to cache a vault miss. However in the current implementation the secret miss is not cached for this duration and is fetched from vault backend every minute.
This PR first fixes the check in the secret rotation timer to not fetch negatively cached vaules unconditionally, but only after the neg_ttl. Then it changes the shdict ttl for negative cache from neg_ttl to neg_ttl + SECRETS_CACHE_MIN_TTL, or else the negative cache will expire from shdict and there's no chance to update it after neg_ttl.
Summary
Checklist
- [x] The Pull Request has tests
- [x] A changelog file has been created under
changelog/unreleased/kongorskip-changeloglabel added on PR if changelog is unnecessary. README.md - [ ] There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE
Issue reference
Fix FTI-6240
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
As per documentation, neg_ttl specifies the time to cache a vault miss. However in the current implementation the secret miss is not cached for this duration and is fetched from vault backend every minute.
I think the reason was that we don't have a clear picture whether something is miss or something else, like a network error. Thus misses we decided to fetch every rotation cycle. We have talked also about n-number of failures, or crowing the time gradually on continuous failures, and ultimately removing the secret from the rotation.
I do not have strong feeling on any direction of this though.
Yes you are true. If there's a network error when fetching vault, it should not be cached for long time, but retry every minute (or using exponential backoffs). How about handle these cases separately, i.e. add an new error entry on failure to cache. We still refresh error caches every minute, but for true missing cache we respect neg_ttl config.
Fixed test and rebased onto master in the force push.
add an new error entry on failure to cache. We still refresh error caches every minute, but for true missing cache we respect neg_ttl config.
Yes, but how to know that it was an error or missing vault key (we may need to consult each vault implementation about it, if even possible)? E.g. 404 does that come from ill configured proxy or from vault (thus we may need to check the payload, there is no standards, so each vault may be different)? But sure if you want to explore this option, I have nothing against it.