fix(balancer) fix upstreams reload every 10s
The upstreams module's load_upstreams_dict_into_memory returned non-cacheable value when upstreams table is empty, causing empty table reload in request context after 10s negative TTL's expiration.
Summary
See https://github.com/Kong/kong/issues/8970#issue-1273886114.
To fix this, empty table may be considered a valid value to cache.
Full changelog
- [Fix empty upstreams table reload every 10s]
Issue reference
Fix #8970
Hello, Sorry for last commit come and go, I was wondering if cache invalidation delay caused this unit test to fail.
Hello, I think I managed to run the failing unit test on my local machine, but the result is the same with both branches (release/2.8.x and fix/empty-upstreams-reload-every-10-seconds). I also added logging instructions on error level in both modified functions in kong/runloop/balancer/upstreams.lua but they didn't appear on unit test run. I tend to think the proposed fix isn't causing the unit test failure. Would someone more experienced than me have a look ? Thanks.
Hey @marc-charpentier, sorry for the late check-in on this (we're all a little bit swamped at the moment prepping the next release).
That test failure looks very similar to one we've been tracking internally that is known to be flaky, so I think it's okay for you to disregard for now.
This changeset looks good to me (thanks again for your efforts--much appreciated!). When I have the time I'll probably do a little exploratory work to see if adding a targeted integration test is feasible, though it might not be necessary in the end. Aside from that, just awaiting a second opinion from somebody else on the team.
Hi @flrgh , thank you for your answer, and no problem for the late check-in. Now all tests passed, although I don't know if someone triggered them again, nor why the previously failing one succeeded this time.
Difficult to rebase so I'm trying to cherry-pick it. Trying to find some way to test this. Sorry to mess up with the tags... I did not notice this is targeting 2.8.x. This fix should also work for the master.
This could be a perf PR or a bug-fixing PR. The former one needs not specific tests but we need to show improvement and the latter one needs tests. I tend to add tests.
Hmm. I was git blame-ing my way around looking for the reason why the upstream negative TTL is hardcoded to 10s, and I found these:
- #5831
- #7002
So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore.
CC @javierguerragiraldez
Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here:
https://github.com/Kong/kong/blob/f38b38ed53a9e0dbc967a69c6eda1e1d7481fc8d/kong/runloop/balancer/upstreams.lua#L96-L99
This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single nil return value will be treated as negative cache if not accompanied by a second (error) return value:
-- arg1, arg2, and arg3 are arguments forwarded to the callback from the
-- `get()` variadic arguments, like so:
-- cache:get(key, opts, callback, arg1, arg2, arg3)
local function callback(arg1, arg2, arg3)
-- I/O lookup logic
-- ...
-- value: the value to cache (Lua scalar or table)
-- err: if not `nil`, will abort get(), which will return `value` and `err`
-- ttl: override ttl for this value
-- If returned as `ttl >= 0`, it will override the instance
-- (or option) `ttl` or `neg_ttl`.
-- If returned as `ttl < 0`, `value` will be returned by get(),
-- but not cached. This return value will be ignored if not a number.
return value, err, ttl
end
In light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error:
if err then
log(CRIT, "could not obtain list of upstreams: ", err)
- return nil
+ return nil, err
end
Hmm. I was
git blame-ing my way around looking for the reason why the upstream negative TTL is hardcoded to 10s, and I found these:
- fix(balancer) don't cache an empty upstream name dict #5831
- fix(balancer) don't cache an empty upstream name dict #7002
So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore.
CC @javierguerragiraldez
Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here:
https://github.com/Kong/kong/blob/f38b38ed53a9e0dbc967a69c6eda1e1d7481fc8d/kong/runloop/balancer/upstreams.lua#L96-L99
This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single
nilreturn value will be treated as negative cache if not accompanied by a second (error) return value:-- arg1, arg2, and arg3 are arguments forwarded to the callback from the -- `get()` variadic arguments, like so: -- cache:get(key, opts, callback, arg1, arg2, arg3) local function callback(arg1, arg2, arg3) -- I/O lookup logic -- ... -- value: the value to cache (Lua scalar or table) -- err: if not `nil`, will abort get(), which will return `value` and `err` -- ttl: override ttl for this value -- If returned as `ttl >= 0`, it will override the instance -- (or option) `ttl` or `neg_ttl`. -- If returned as `ttl < 0`, `value` will be returned by get(), -- but not cached. This return value will be ignored if not a number. return value, err, ttl endIn light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error:
if err then log(CRIT, "could not obtain list of upstreams: ", err) - return nil + return nil, err end
Nice finding. Since we have now corrected the handling probably we could remove that option?
Since we have now corrected the handling probably we could remove that option?
Interestingly, the hardcoded negative TTL came from #4301. This thread is the explanation I was looking for.
Many of the concerns raised there seem to still be somewhat valid, though I tend to think we should trust that upstreams:each() will always return an error on an unexpected failure.
I think a valuable change would be to detect the case where there are zero upstreams and no error is encountered and add a debug log entry in this case. That way this scenario is at least observable if/when there's a reason to suspect that something isn't working correctly.
@locao what do you think about all of this?
Reviewing this is in my backlog, sorry for the long delay. I'll check it asap.