kong One or some workers in one of 14 kong nodes still use the old cache after the kong api is updated

One or some workers in one of 14 kong nodes still use the old cache after the kong api is updated

Open AndyEWang opened this issue 2 years ago • 13 comments

Is there an existing issue for this?

[X] I have searched the existing issues

Kong version (`$ kong version`)

Kong 0.12.3

Current Behavior

There are 14 kong nodes in our apigateway cluster which uses PostgreSQL as db. Each kong node runs 4 nginx workers. All TTL configs remain default.

#db_update_frequency = 5 #db_update_propagation = 0 #db_cache_ttl = 3600

We added a new custom plugin to an existed kong api. After testing with some http requests, we found that some requests routed by one kong node didn't execute the new plugin. Then we login that kong node and curl 127.0.0.1 with host to force the requests going through that kong api and node. Finally, some requests executed the plugin and some didn't. We suspect one or several nginx workers still use the old cache. After 8 hours, the issue still existed. But after reloading nginx process, everything works well.

Expected Behavior

The newly added plugin should be propagated to all kong nodes and all workers.

Steps To Reproduce

The issue happens rarely.

Anything else?

We found following logs after the plugin was added to an existed kong api. 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 314, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 315, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 316, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 317, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 318, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 319, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 320, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 321, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 322, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 323, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 324, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 325, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 326, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 327, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 328, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 329, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 330, context: ngx.timer

Mar 08 '22 02:03 AndyEWang

Hi @AndyEWang, thanks for reporting this issue.

From the error logs you provided, seems related to worker-events. I am not sure if this issue still exists on the latest Kong release, I recommend you to upgrade to the latest Kong releases.

Mar 08 '22 03:03 mayocream

I'm afraid that no upgrades are planned in the near future. From source code, kong 0.12.3 is using its own mlcache not the lib lua-resty-mlcache, right? The config db_cache_ttl should force all kong nodes to refresh cache even no updates, right?

Mar 08 '22 05:03 AndyEWang

@AndyEWang, luckily I just found your issue in here. Our logs don't show your messages but we are struggling with same issue but using 2.7.0 version. This is what our update config looks like.

db_update_frequency = 30
db_update_propagation = 5
#db_cache_ttl = 0

I fully agree with you problem doesn't seem to be node but worker specific. Last days there was request to disable some specific API. We did so using request-termination plugin but requests were still able to access API for let's say 40%. Restart/reload fixes it in any case. We already had problems with custom plugin as well.

Is there any possibility to log or retrieve some kind of worker's configuration stamp/hash to proof if some worker's config is really outdated?

Thanks Henning

Apr 25 '22 08:04 23henne

Additionally, I was interested in your current worker_consistency setting.

Apr 25 '22 09:04 23henne

@23henne there is no worker_consistency config in kong 0.12.3

Apr 26 '22 08:04 AndyEWang

We have a similar issue, though for us it is a change in a service target url. We switched the target URL to a new one and the old target server was shut down. The requests were sent to the new target successfully for a while, but then a new worker was started on our Kong pod, and this worker tried sending the requests to the old service target url only (which fails).

This behavior was noticed this morning, and stopped after we updated a different route (not part of this service), which probably (?) triggered a cache refresh/update/...?

We are using Kong 2.8.1, with Postgres. db_update_frequency = 5, db_update_propagation = 0, db_cache_ttl = 0, worker_consistency = strict are all on default for us.

So we need to understand:

Why does a newly started worker use old values for the service target?
Why does an update to an unrelated config item "fix" this worker?

Benjamin Bertow [email protected], Mercedes-Benz Tech Innovation GmbH. https://github.com/mercedes-benz/foss/blob/master/PROVIDER_INFORMATION.md

Apr 29 '22 08:04 Benny-Git

Anybody any ideas? This is really harming our business.

May 10 '22 12:05 23henne

I have been able to "reliably" recreate the issue in our DEV environment:

Creat service pointing to http://httpbin.utility:80/anything
Kill one one of the Kong DEV pods
On my laptop run watch -n 1 curl <route_url>
Requests go to upstream: http://192.168.99.186:80/anything with worker 1140
Edit service via to have path /get instead of /anything
Observe via logs the upstream changed to http://192.168.99.186:80/get with worker 1140
ps shows active workers 1140 and 1141
Remove one with kill 1140. New worker 1145 is created
Observe upstream is still http://192.168.99.186:80/get with worker 1141
Remove another one with kill 1141. New worker 1146 is created.
Observe upstream is now http://192.168.99.186:80/anything served by worker 1145 ⚡
After running curl -XDELETE localhost:8001/cache worker 1145 also serves http://192.168.99.186:80/get

Interestingly enough, repeating the steps above with switching the path from /get to /anything does not result in the issue -- this only worked reliably with switching the path from /anything to /get.

Are there any debug trace settings that could help pinpoint the root cause?

May 10 '22 12:05 Benny-Git

Kong 2.8.1 ,same issue with cached targets

May 25 '22 09:05 uvish

Did anyone have a chance to look into this? Our production system is hit by this issue more and more over the last weeks, right now even newly created routes get "lost" on newly spawned nginx workers. I could see that on one Kong instance 50% of our requests succeeded, and 50% got a "no Route matched with those values". Our instance was running with 2 workers, and when the issue started one of the workers had just been killed and recreated.

It looks like newly spawned workers receive a stale configuration into their cache.

Is there any debug setting that could help us nail this issue down?

Jun 30 '22 10:06 Benny-Git

same here. kong still point to the lost targets instead the updated targets. in the logs,many "connection refused,connection timeouts" errors are generated.

Jul 07 '22 06:07 zffocussss

Hello, there's several key important points on this issue:

@AndyEWang:

Kong 0.12.3 I'm afraid that no upgrades are planned in the near future.

We don't have the bandwidth to support kong versions from 2 mayor versions ago (almost 3 at this point). Many changes have happened since then. If you can't upgrade, there's very little we can do in order to help you.

We added a new custom plugin to an existed kong api

Writing custom plugins is difficult, especially with older versions of Kong - I am not sure 0.12.3 even had a PDK yet. I would suggest that if things were working fine before introducing this new custom plugin, then this new plugin would be the main culprint.

@23henne, @Benny-Git:

same issue but using 2.7.0 version Kong 2.8.1

The versions of kong you are using might present similar symptoms might present similar symtoms to @AndyEWang's, but all the underlying tech has changed. Also, you are not using a custom plugin like @AndyEWang is doing.

Tracking your situation using the same issue @AndyEWang created will only make things more confusing. I recommend you open a different ticket, putting as much information relevant to your issue on it.

Jul 13 '22 08:07 kikito

Created new issue https://github.com/Kong/kong/issues/9090 for this.

Jul 13 '22 09:07 Benny-Git

I created similar issue but about target: #10020.

Dec 28 '22 16:12 gung-rgb

kong kong copied to clipboard

One or some workers in one of 14 kong nodes still use the old cache after the kong api is updated

Is there an existing issue for this?

Kong version ($ kong version)

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

kong
kong copied to clipboard

Kong version (`$ kong version`)