kong icon indicating copy to clipboard operation
kong copied to clipboard

chore(cache): increased the default lock timeout to 10s that mlcache

Open ms2008 opened this issue 1 year ago • 4 comments

Summary

Currently, this lock timeout is hardcoded and does not provide an optional configuration parameter out of the box. We have received multiple complaints from users with the following error in their logs:

failed to get from node cache: could not acquire callback lock: timeout

So, here we double the default timeout time

Checklist

  • [ ] The Pull Request has tests
  • [ ] A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • [ ] There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix FTI-4299

ms2008 avatar Apr 29 '24 08:04 ms2008

How feasible it is to expect that if "the thing" doesn't happen in 5 secs, it will happen in 10 secs?

bungle avatar Apr 29 '24 10:04 bungle

Are there any updates?

dubuqingfeng avatar May 07 '24 02:05 dubuqingfeng

How feasible it is to expect that if "the thing" doesn't happen in 5 secs, it will happen in 10 secs?

I don't have any quantifiable data to provide strong support for the fix here, much like setting a TCP timeout, which depends on the specific environment of each individual. However, in practice, increasing the value here to 10s doesn't seem to pose significant risks. Additionally, we have several users who, after independently increasing this limit, no longer encounter issues.

So, if we prefer not to expose this configuration to users, increasing this limit could indeed solve some users' problems.

ms2008 avatar May 11 '24 08:05 ms2008

I don't have any quantifiable data to provide strong support for the fix here, much like setting a TCP timeout, which depends on the specific environment of each individual. However, in practice, increasing the value here to 10s doesn't seem to pose significant risks.

Yes, but where is the limit. Next time we increase it to 30 secs? Is it risk that this starts showing up in max latencies or p99 latencies etc.? More things piling up? More memory used because of piling up? I know that current limit is magic number, just like this proposal is. I would see we somehow fix the root cause or something. Why does retrieving certificate take more than 5 secs. Why it even takes time, why isn't it cached already etc.

Additionally, we have several users who, after independently increasing this limit, no longer encounter issues. So, if we prefer not to expose this configuration to users, increasing this limit could indeed solve some users' problems.

Yes, but at the same time hide the problems that we probably want to fix. Same question can be asked about any of our timeouts, why not just increase every timeout?

bungle avatar May 13 '24 09:05 bungle