kong icon indicating copy to clipboard operation
kong copied to clipboard

Kong routes requests to targets deleted via Admin API

Open mkornatz opened this issue 2 years ago • 7 comments

Is there an existing issue for this?

  • [x] #9019

Kong version ($ kong version)

2.8.1

Current Behavior

When deleting a target from an upstream via the Admin API, we see a recurring issue where the Kong Gateway proxy nodes continue to route traffic to the target. The Admin API shows the correct state (the target is removed), but we can confirm from the request logs that traffic is still being routed to the target long after the target is removed.

Our current workaround has been to restart the proxy nodes when we detect this issue, which resolves the issue, but is causing other side-effects, such as cutting off long running API requests.

About 1 second after the target is removed, using the DELETE /ustreams/{upstream_id}/targets/{target_id} Admin API, we see log entries like the following coming from the Kong Gateway. Note: these log lines may be out of order due to how our logging system collects entries with the same timestamp.

2022/05/02 15:10:39 [error] 1108#0: *20612 [lua] events.lua:194: do_handlerlist(): worker-events: event callback failed; source=balancer, event=targets, pid=1115 error='/usr/local/openresty/lualib/resty/lock.lua:153: API disabled in the context of log_by_lua*
log_by_lua(nginx-kong.conf:136):2: in main chunk', data={"operation":"create","entity":{"upstream":{"id":"943f01f4-60d0-4435-933d-c76adbbe7060"}}} while logging request, client: REDACTED, server: kong, request: "POST / HTTP/1.1", upstream: "http://REDACTED:80/", host: "REDACTED"
/usr/local/share/lua/5.1/kong/init.lua:1380: in function 'log'
/usr/local/share/lua/5.1/kong/runloop/handler.lua:1651: in function 'after'
...l/share/lua/5.1/kong/runloop/balancer/healthcheckers.lua:169: in function 'report_http_status'
/usr/local/share/lua/5.1/resty/healthcheck.lua:582: in function 'report_http_status'
/usr/local/share/lua/5.1/resty/healthcheck.lua:560: in function 'run_mutexed_fn'
[C]: in function 'pcall'
/usr/local/share/lua/5.1/resty/healthcheck.lua:676: in function </usr/local/share/lua/5.1/resty/healthcheck.lua:640>
/usr/local/share/lua/5.1/resty/healthcheck.lua:1242: in function 'raise_event'
/usr/local/share/lua/5.1/resty/worker/events.lua:361: in function 'post'
/usr/local/share/lua/5.1/resty/worker/events.lua:219: in function 'do_event_json'
/usr/local/share/lua/5.1/resty/worker/events.lua:185: in function 'do_handlerlist'
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/runloop/handler.lua:240: in function </usr/local/share/lua/5.1/kong/runloop/handler.lua:235>
/usr/local/share/lua/5.1/kong/runloop/balancer/targets.lua:154: in function 'on_target_event'
.../local/share/lua/5.1/kong/runloop/balancer/balancers.lua:198: in function 'create_balancer'
.../local/share/lua/5.1/kong/runloop/balancer/balancers.lua:105: in function 'create_balancer_exclusive'
/usr/local/share/lua/5.1/kong/cache/init.lua:168: in function 'fetch_targets'
/usr/local/share/lua/5.1/resty/mlcache.lua:685: in function 'get'
/usr/local/openresty/lualib/resty/lock.lua:153: in function 'lock'
[C]: in function 'sleep'
stack traceback:

Intermixed with those errors above, we also see several log entries like the following related to the given target/upstream.

1108#0: *26698 [lua] healthcheckers.lua:171: report_http_status(): [healthchecks] failed reporting status: true while logging request

This issue is intermittent, and doesn't always occur. It seems to be most prevalent when the targets being delete are having issues and have become unresponsive or take a very long time (30s) to respond to requests.

Expected Behavior

When a target is removed from an upstream, we expect that traffic will no longer be routed to it. With eventual Kong worker consistency, we don't expect this change to be immediately applied. However, we do expect it to take effect within at least several seconds after the target is removed.

Steps To Reproduce

We are still working to reproduce the issue in an isolated environment. However, in our production environment we can confirm the follow steps are taken to see the behavior.

1. In an environment with multiple gateway proxy nodes (at least 2)
2. Create a service and upstream and configure active and passive health checks with a standard configuration
3. Add target_1 and target_2 to the upstream
4. Force target_2 to become unresponsive or take a long time to respond to requests
5. Delete target_2 from the upstream

Anything else?

  • We are using the following config values. We have tried multiple variations of these configs with no change in behavior.
KONG_DB_UPDATE_FREQUENCY: "5"
KONG_WORKER_STATE_UPDATE_FREQUENCY: "5"
KONG_WORKER_CONSISTENCY: "eventual"
  • In our environment, we often have targets that become unresponsive or start serving incorrect data when they are under heavy load. A background job in a separate system is responsible for deleting these targets when they get into a problematic state. This often leads to targets being added and removed every 5 minutes or so.
  • Both the admin and proxy nodes are running in a Kubernetes cluster
  • The logs included in the report above seem to be the only relevant logs
  • The upstream is configured to use both active and passive checks for the targets

mkornatz avatar May 04 '22 17:05 mkornatz

I'm wondering if it might be same issue or at least related to #8517. But in your case it logs at least a more or less meaningful error message. Maybe same but different but same.

23henne avatar May 09 '22 11:05 23henne

@23henne Thanks for pointing me to that other issue. It does indeed look similar. I'm still having trouble reproducing in a non-production environment, but if I can, I'll also mention it in the other issue.

mkornatz avatar May 19 '22 21:05 mkornatz

Same issue here ,Kong 2.8.1 logs shows requests routed to old IP ,but admin api /targets show only one target with new ip. Update : On checking the DB we found multiple entries for same upstreams , one with old format and one with new format . On clearing all the older entries , this problem was fixed . This might have occured because of incomplete migration.

uvish avatar May 25 '22 09:05 uvish

This may be related to #8976

esatterwhite avatar Jun 20 '22 11:06 esatterwhite

Hi @mkornatz! Thanks for you report. The log you shared is short of the stack traceback. Can you share the next lines after stack traceback:?

locao avatar Jul 08 '22 13:07 locao

Hello @locao I'm Matt's colleague and I believe that that is the stack traceback, as Matt mentioned Note: these log lines may be out of order due to how our logging system collects entries with the same timestamp.

ybce avatar Jul 08 '22 18:07 ybce

Some issue, kong version is 2.1.4.

Use kong admin api add one target(10...:47179) and remove one target(10...:57593). One of kong continue to access the removed target(10...:57593). Use kong admin get target list that it is correct, the info: { "next": null, "data": [ { "created_at": 1658732450.015, "id": "a2c1f98b-2142-43b4-b489-eb223b9ae970", "tags": null, "weight": 100, #new target(is live) "target": "10...:47179", "upstream": { "id": "fadc44a1-550e-410f-be86-de67ce1a3acb" } } ] } The question myabe auto fix when update the upstream.

altnti avatar Jul 25 '22 09:07 altnti

The related PR is https://github.com/Kong/kong/pull/10384. Close it.

raoxiaoyan avatar Aug 31 '23 02:08 raoxiaoyan

Hi Team, We are observing the exact same issue in Kong version 3.2.2 as well i.e., When deleting a target from an upstream via the Admin API, we see a recurring issue where the Kong Gateway proxy nodes continue to route traffic to the target. We had instances in our production environment where some of the kong nodes were proxying the request to an upstream target which has been terminated more than 2 days ago. The Admin API shows the correct state of all targets (the target is removed), but from the request logs we see that traffic is still being proxied to the target long after the target is removed.

We are trying to reproduce the issue in Non-Production environment, but we haven't been able to do so yet. We shall share more detail if we are able to reproduce the issue.

Kong Configuration: version : 3.2.2 Mode : traditional mode with postgres DB

1madhu9 avatar Apr 07 '24 21:04 1madhu9