feat(conf): add concurrency_timeout option
Summary
Currently, our Kong Gateway used a fixed timeout value (60s) for concurrency jobs.
However, some concurrency jobs may take longer. For example, a declarative config reconfigure case:
- Operator pushed config A, and Kong is applying it. This job will take
65s. - Operator pushed config B (e.g. changed a plugin config), and Kong will wait for the job A, but timeout after
60s. - Operator thought config B was applied, but not. The operator went for a cup of coffee.
feat(conf): add concurrency_timeout option
Sets the timeout (in ms) for which a concurrency job (e.g. declaractive reconfigure) should wait for mutex before giving up. After the timeout, Kong gateway would skip the job, and log an error message.
If you happen to see "timeout acquiring ... lock" log entries, try to adjust this value accordingly. Please be careful that setting a large value might result in a long list of concurreny jobs in queue, which could overflow the queue.
It is fine to timeout if a job is safe to be overriden by subsequent jobs, like declaractive reconfigure. Do not touch this option unless you clear about it.
Checklist
- [x] The Pull Request has tests
- [x] A changelog file has been created under
changelog/unreleased/kongorskip-changeloglabel added on PR if changelog is unnecessary. README.md - [ ] There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE
Issue reference
Fix #FTI-5969
Duplicate of already closed PR: https://github.com/Kong/kong/pull/13027
If these take 60+ secs, we have a bigger problems somewhere. And I have proposed fixes to make this faster.
If we want to do something, I would limit the change totally to reconfigure_timeout or _reconfigure_timeout and call it that. Changing concurrency library default value can have unwanted side-effects elsewhere. But I am not sure if we need this at all. It is much better to make reconfigure to not take 60+ seconds. Like faster loading of entities etc (easy to do). And it does definitely not take 60+ secs on 99,99999% cases. I have tested some pretty large configs on my laptop and it never goes to those numbers. It is possible that something has gotten much slower, like router rebuilding or plugin iterator rebuilding.
Also when this issue happens it is already too late to change this value. If customer even asks, and then we ask them to raise _reconfigure_timeout to what? 2000s?
I think it is rather easy to make reconfigure_timeout self-adjusting. E.g. make it math.min(math.max(longest_rebuild_that_has_happened_on_this_worker * 1.5, 60), 600)
Why would you need to restart node for this and make manual config changes when machine can do it.
And if you never want to timeout, then well, why we even have a timeout then? Let's just these reconfigures pile up? Sure we have timeout for some reason.