kong icon indicating copy to clipboard operation
kong copied to clipboard

fix(vault): fix vault config neg_ttl behavior

Open cshuaimin opened this issue 11 months ago • 6 comments

As per documentation, neg_ttl specifies the time to cache a vault miss. However in the current implementation the secret miss is not cached for this duration and is fetched from vault backend every minute.

This PR first fixes the check in the secret rotation timer to not fetch negatively cached vaules unconditionally, but only after the neg_ttl. Then it changes the shdict ttl for negative cache from neg_ttl to neg_ttl + SECRETS_CACHE_MIN_TTL, or else the negative cache will expire from shdict and there's no chance to update it after neg_ttl.

Summary

Checklist

  • [x] The Pull Request has tests
  • [x] A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • [ ] There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix FTI-6240

cshuaimin avatar Jan 14 '25 08:01 cshuaimin

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jan 14 '25 08:01 CLAassistant

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Jan 14 '25 08:01 CLAassistant

As per documentation, neg_ttl specifies the time to cache a vault miss. However in the current implementation the secret miss is not cached for this duration and is fetched from vault backend every minute.

I think the reason was that we don't have a clear picture whether something is miss or something else, like a network error. Thus misses we decided to fetch every rotation cycle. We have talked also about n-number of failures, or crowing the time gradually on continuous failures, and ultimately removing the secret from the rotation.

I do not have strong feeling on any direction of this though.

bungle avatar Jan 14 '25 16:01 bungle

Yes you are true. If there's a network error when fetching vault, it should not be cached for long time, but retry every minute (or using exponential backoffs). How about handle these cases separately, i.e. add an new error entry on failure to cache. We still refresh error caches every minute, but for true missing cache we respect neg_ttl config.

cshuaimin avatar Jan 15 '25 03:01 cshuaimin

Fixed test and rebased onto master in the force push.

cshuaimin avatar Jan 15 '25 05:01 cshuaimin

add an new error entry on failure to cache. We still refresh error caches every minute, but for true missing cache we respect neg_ttl config.

Yes, but how to know that it was an error or missing vault key (we may need to consult each vault implementation about it, if even possible)? E.g. 404 does that come from ill configured proxy or from vault (thus we may need to check the payload, there is no standards, so each vault may be different)? But sure if you want to explore this option, I have nothing against it.

bungle avatar Jan 15 '25 06:01 bungle