chore(cache): increased the default lock timeout to 10s that mlcache
Summary
Currently, this lock timeout is hardcoded and does not provide an optional configuration parameter out of the box. We have received multiple complaints from users with the following error in their logs:
failed to get from node cache: could not acquire callback lock: timeout
So, here we double the default timeout time
Checklist
- [ ] The Pull Request has tests
- [ ] A changelog file has been created under
changelog/unreleased/kongorskip-changeloglabel added on PR if changelog is unnecessary. README.md - [ ] There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE
Issue reference
Fix FTI-4299
How feasible it is to expect that if "the thing" doesn't happen in 5 secs, it will happen in 10 secs?
Are there any updates?
How feasible it is to expect that if "the thing" doesn't happen in 5 secs, it will happen in 10 secs?
I don't have any quantifiable data to provide strong support for the fix here, much like setting a TCP timeout, which depends on the specific environment of each individual. However, in practice, increasing the value here to 10s doesn't seem to pose significant risks. Additionally, we have several users who, after independently increasing this limit, no longer encounter issues.
So, if we prefer not to expose this configuration to users, increasing this limit could indeed solve some users' problems.
I don't have any quantifiable data to provide strong support for the fix here, much like setting a TCP timeout, which depends on the specific environment of each individual. However, in practice, increasing the value here to 10s doesn't seem to pose significant risks.
Yes, but where is the limit. Next time we increase it to 30 secs? Is it risk that this starts showing up in max latencies or p99 latencies etc.? More things piling up? More memory used because of piling up? I know that current limit is magic number, just like this proposal is. I would see we somehow fix the root cause or something. Why does retrieving certificate take more than 5 secs. Why it even takes time, why isn't it cached already etc.
Additionally, we have several users who, after independently increasing this limit, no longer encounter issues. So, if we prefer not to expose this configuration to users, increasing this limit could indeed solve some users' problems.
Yes, but at the same time hide the problems that we probably want to fix. Same question can be asked about any of our timeouts, why not just increase every timeout?