fix upstream_dict nil while updating an upstream
Summary
This PR was originally submitted by @ranxuxin. I just rebased it to be part of the release 2.3.0, below is their original description:
Hi everyone,
clients requests return 500 while using admin API to change an upstream with "eventual" worker_consistency. Although (#6104) fixes this bug, clients requests return 503 instead of return 500. However, either 500 or 503 are not acceptable on production environment.
The reason is that singletons.core_cache:invalidate_local("balancer:upstreams") and singletons.core_cache:get("balancer:upstreams") are not atomic in do_upstream_event() and update_balancer_state(). When clients requests come between this two steps, kong throws stack traceback. As below:
2020/11/10 14:36:27 [error] 2625#0: *8424 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/runloop/balancer.lua:672: attempt to index local 'upstreams_dict' (a nil value) stack traceback: coroutine 0: /usr/local/share/lua/5.1/kong/runloop/balancer.lua: in function 'get_upstream_by_name' /usr/local/share/lua/5.1/kong/runloop/balancer.lua:693: in function 'get_balancer' /usr/local/share/lua/5.1/kong/runloop/balancer.lua:1090: in function 'execute' /usr/local/share/lua/5.1/kong/runloop/handler.lua:891: in function 'balancer_execute' /usr/local/share/lua/5.1/kong/runloop/handler.lua:1264: in function 'after' /usr/local/share/lua/5.1/kong/init.lua:788: in function 'access' access_by_lua(nginx-kong.conf:80):2: in main chunk, client: 127.0.0.1, server: kong, request: "GET / HTTP/1.1", host: "www.test500a.com"
2020/11/10 18:17:10 [error] 20512#0: *1783168 [lua] events.lua:194: do_handlerlist(): worker-events: event callback failed; source=balancer, event=upstreams, pid=20513 error='/usr/local/share/lua/5.1/kong/runloop/balancer.lua:519: attempt to index local 'upstream' (a nil value) stack traceback: /usr/local/share/lua/5.1/kong/runloop/balancer.lua:519: in function 'create_balancer' /usr/local/share/lua/5.1/kong/runloop/balancer.lua:1050: in function 'do_upstream_event' /usr/local/share/lua/5.1/kong/runloop/balancer.lua:1066: in function 'on_upstream_event' /usr/local/share/lua/5.1/kong/runloop/handler.lua:406: in function </usr/local/share/lua/5.1/kong/runloop/handler.lua:401> [C]: in function 'xpcall' /usr/local/share/lua/5.1/resty/worker/events.lua:185: in function 'do_handlerlist' /usr/local/share/lua/5.1/resty/worker/events.lua:219: in function 'do_event_json' /usr/local/share/lua/5.1/resty/worker/events.lua:361: in function 'poll' /usr/local/share/lua/5.1/resty/worker/events.lua:380: in function </usr/local/share/lua/5.1/resty/worker/events.lua:375>', data={"operation":"update","entity":{"created_at":1604923597,"hash_on":"none","id":"cee3bfb3-fcd8-4f87-aa7a-2086c361d025","algorithm":"round-robin","name":"test500a","hash_on_cookie_path":"/","healthchecks":{"threshold":0,"active":{"unhealthy":{"http_statuses":[429,404,500,501,502,503,504,505],"tcp_failures":0,"timeouts":0,"http_failures":0,"interval":0},"http_path":"/","timeout":1,"type":"http","healthy":{"successes":0,"interval":0,"http_statuses":[200,302]},"https_verify_certificate":true,"concurrency":10},"passive":{"unhealthy":{"http_failures":0,"http_statuses":[429,500,503],"tcp_failures":0,"timeouts":0},"healthy":{"http_statuses":[200,201,202,203,204,205,206,207,208,226,300,301,302,303,304,305,306,307,308],"successes":0},"type":"http"}},"hash_fallback":"none","slots":2800}}, context: ngx.timer
Full changelog We defined "singletons.core_cache:set()" interface in cache.lua. And used it instead of singletons.core_cache:invalidate_local("balancer:upstreams") and singletons.core_cache:get("balancer:upstreams").
Issues resolved Fixes: #6561
Closes: #6569
Hi @locao,
I pushed the commit to resolve kikito's change request in branch fix/upstream_dict_nil of my repository (https://github.com/ranxuxin/kong.git). Could you help me to merge into this repository, please?
Best Regards, Ran Xuxin
@ranxuxin I applied all those suggestions and rebased the PR to make it follow our styling better. Feel free to take a look!
Also added a minor performance improvement by caching the empty opts table instead of creating one and throw away immediately after.
@dndx , thank you to apply these suggestions. It looks great.
@dndx , it seems the new commit didn't pass checks. Could you help me to fix it, please?
@ranxuxin Thanks for your contribution! 👍
@dndx Do you have time to solve the conflict to get this PR merged?
Closing because of lack of activity. We would like to accept this contribution if anyone wishes to pick this back up.
@hbagdi @locao @dndx Does anyone knows where are we now? I'm revisiting old issues.