kong icon indicating copy to clipboard operation
kong copied to clipboard

fix(balancer) fix upstreams reload every 10s

Open marc-charpentier opened this issue 3 years ago • 5 comments

The upstreams module's load_upstreams_dict_into_memory returned non-cacheable value when upstreams table is empty, causing empty table reload in request context after 10s negative TTL's expiration.

Summary

See https://github.com/Kong/kong/issues/8970#issue-1273886114.

To fix this, empty table may be considered a valid value to cache.

Full changelog

  • [Fix empty upstreams table reload every 10s]

Issue reference

Fix #8970

marc-charpentier avatar Jun 17 '22 14:06 marc-charpentier

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jun 17 '22 14:06 CLAassistant

Hello, Sorry for last commit come and go, I was wondering if cache invalidation delay caused this unit test to fail.

marc-charpentier avatar Jun 23 '22 14:06 marc-charpentier

Hello, I think I managed to run the failing unit test on my local machine, but the result is the same with both branches (release/2.8.x and fix/empty-upstreams-reload-every-10-seconds). I also added logging instructions on error level in both modified functions in kong/runloop/balancer/upstreams.lua but they didn't appear on unit test run. I tend to think the proposed fix isn't causing the unit test failure. Would someone more experienced than me have a look ? Thanks.

marc-charpentier avatar Jun 24 '22 13:06 marc-charpentier

Hey @marc-charpentier, sorry for the late check-in on this (we're all a little bit swamped at the moment prepping the next release).

That test failure looks very similar to one we've been tracking internally that is known to be flaky, so I think it's okay for you to disregard for now.

This changeset looks good to me (thanks again for your efforts--much appreciated!). When I have the time I'll probably do a little exploratory work to see if adding a targeted integration test is feasible, though it might not be necessary in the end. Aside from that, just awaiting a second opinion from somebody else on the team.

flrgh avatar Jun 24 '22 18:06 flrgh

Hi @flrgh , thank you for your answer, and no problem for the late check-in. Now all tests passed, although I don't know if someone triggered them again, nor why the previously failing one succeeded this time.

marc-charpentier avatar Jun 26 '22 14:06 marc-charpentier

Difficult to rebase so I'm trying to cherry-pick it. Trying to find some way to test this. Sorry to mess up with the tags... I did not notice this is targeting 2.8.x. This fix should also work for the master.

StarlightIbuki avatar Nov 10 '22 09:11 StarlightIbuki

This could be a perf PR or a bug-fixing PR. The former one needs not specific tests but we need to show improvement and the latter one needs tests. I tend to add tests.

StarlightIbuki avatar Nov 10 '22 09:11 StarlightIbuki

Hmm. I was git blame-ing my way around looking for the reason why the upstream negative TTL is hardcoded to 10s, and I found these:

  • #5831
  • #7002

So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore.

CC @javierguerragiraldez


Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here:

https://github.com/Kong/kong/blob/f38b38ed53a9e0dbc967a69c6eda1e1d7481fc8d/kong/runloop/balancer/upstreams.lua#L96-L99

This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single nil return value will be treated as negative cache if not accompanied by a second (error) return value:

-- arg1, arg2, and arg3 are arguments forwarded to the callback from the
-- `get()` variadic arguments, like so:
-- cache:get(key, opts, callback, arg1, arg2, arg3)

local function callback(arg1, arg2, arg3)
    -- I/O lookup logic
    -- ...

    -- value: the value to cache (Lua scalar or table)
    -- err: if not `nil`, will abort get(), which will return `value` and `err`
    -- ttl: override ttl for this value
    --      If returned as `ttl >= 0`, it will override the instance
    --      (or option) `ttl` or `neg_ttl`.
    --      If returned as `ttl < 0`, `value` will be returned by get(),
    --      but not cached. This return value will be ignored if not a number.
    return value, err, ttl
end

In light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error:

    if err then
      log(CRIT, "could not obtain list of upstreams: ", err)
-      return nil
+      return nil, err
    end

flrgh avatar Nov 15 '22 22:11 flrgh

Hmm. I was git blame-ing my way around looking for the reason why the upstream negative TTL is hardcoded to 10s, and I found these:

So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore.

CC @javierguerragiraldez

Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here:

https://github.com/Kong/kong/blob/f38b38ed53a9e0dbc967a69c6eda1e1d7481fc8d/kong/runloop/balancer/upstreams.lua#L96-L99

This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single nil return value will be treated as negative cache if not accompanied by a second (error) return value:

-- arg1, arg2, and arg3 are arguments forwarded to the callback from the
-- `get()` variadic arguments, like so:
-- cache:get(key, opts, callback, arg1, arg2, arg3)

local function callback(arg1, arg2, arg3)
    -- I/O lookup logic
    -- ...

    -- value: the value to cache (Lua scalar or table)
    -- err: if not `nil`, will abort get(), which will return `value` and `err`
    -- ttl: override ttl for this value
    --      If returned as `ttl >= 0`, it will override the instance
    --      (or option) `ttl` or `neg_ttl`.
    --      If returned as `ttl < 0`, `value` will be returned by get(),
    --      but not cached. This return value will be ignored if not a number.
    return value, err, ttl
end

In light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error:

    if err then
      log(CRIT, "could not obtain list of upstreams: ", err)
-      return nil
+      return nil, err
    end

Nice finding. Since we have now corrected the handling probably we could remove that option?

StarlightIbuki avatar Nov 16 '22 09:11 StarlightIbuki

Since we have now corrected the handling probably we could remove that option?

Interestingly, the hardcoded negative TTL came from #4301. This thread is the explanation I was looking for.

Many of the concerns raised there seem to still be somewhat valid, though I tend to think we should trust that upstreams:each() will always return an error on an unexpected failure.

I think a valuable change would be to detect the case where there are zero upstreams and no error is encountered and add a debug log entry in this case. That way this scenario is at least observable if/when there's a reason to suspect that something isn't working correctly.

@locao what do you think about all of this?

flrgh avatar Nov 16 '22 22:11 flrgh

Reviewing this is in my backlog, sorry for the long delay. I'll check it asap.

locao avatar Dec 09 '22 13:12 locao