kong
kong copied to clipboard
One or some workers in one of 14 kong nodes still use the old cache after the kong api is updated
Is there an existing issue for this?
- [X] I have searched the existing issues
Kong version ($ kong version
)
Kong 0.12.3
Current Behavior
There are 14 kong nodes in our apigateway cluster which uses PostgreSQL as db. Each kong node runs 4 nginx workers. All TTL configs remain default.
#db_update_frequency = 5 #db_update_propagation = 0 #db_cache_ttl = 3600
We added a new custom plugin to an existed kong api. After testing with some http requests, we found that some requests routed by one kong node didn't execute the new plugin. Then we login that kong node and curl 127.0.0.1 with host to force the requests going through that kong api and node. Finally, some requests executed the plugin and some didn't. We suspect one or several nginx workers still use the old cache. After 8 hours, the issue still existed. But after reloading nginx process, everything works well.
Expected Behavior
The newly added plugin should be propagated to all kong nodes and all workers.
Steps To Reproduce
The issue happens rarely.
Anything else?
We found following logs after the plugin was added to an existed kong api. 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 314, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 315, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 316, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 317, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 318, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 319, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 320, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 321, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 322, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 323, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 324, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 325, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 326, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 327, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 328, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 329, context: ngx.timer 2022/03/07 03:02:40 [error] 30#30: *7171374869 [lua] events.lua:103: errlog(): worker-events: dropping event; waiting for event data timed out, id: 330, context: ngx.timer
Hi @AndyEWang, thanks for reporting this issue.
From the error logs you provided, seems related to worker-events
.
I am not sure if this issue still exists on the latest Kong release, I recommend you to upgrade to the latest Kong releases.
I'm afraid that no upgrades are planned in the near future. From source code, kong 0.12.3 is using its own mlcache not the lib lua-resty-mlcache, right? The config db_cache_ttl should force all kong nodes to refresh cache even no updates, right?
@AndyEWang, luckily I just found your issue in here. Our logs don't show your messages but we are struggling with same issue but using 2.7.0 version. This is what our update config looks like.
db_update_frequency = 30
db_update_propagation = 5
#db_cache_ttl = 0
I fully agree with you problem doesn't seem to be node but worker specific. Last days there was request to disable some specific API. We did so using request-termination plugin but requests were still able to access API for let's say 40%. Restart/reload fixes it in any case. We already had problems with custom plugin as well.
Is there any possibility to log or retrieve some kind of worker's configuration stamp/hash to proof if some worker's config is really outdated?
Thanks Henning
Additionally, I was interested in your current worker_consistency
setting.
@23henne there is no worker_consistency config in kong 0.12.3
We have a similar issue, though for us it is a change in a service target url. We switched the target URL to a new one and the old target server was shut down. The requests were sent to the new target successfully for a while, but then a new worker was started on our Kong pod, and this worker tried sending the requests to the old service target url only (which fails).
This behavior was noticed this morning, and stopped after we updated a different route (not part of this service), which probably (?) triggered a cache refresh/update/...?
We are using Kong 2.8.1, with Postgres. db_update_frequency = 5, db_update_propagation = 0, db_cache_ttl = 0, worker_consistency = strict are all on default for us.
So we need to understand:
- Why does a newly started worker use old values for the service target?
- Why does an update to an unrelated config item "fix" this worker?
Benjamin Bertow [email protected], Mercedes-Benz Tech Innovation GmbH. https://github.com/mercedes-benz/foss/blob/master/PROVIDER_INFORMATION.md
Anybody any ideas? This is really harming our business.
I have been able to "reliably" recreate the issue in our DEV environment:
- Creat service pointing to http://httpbin.utility:80/anything
- Kill one one of the Kong DEV pods
- On my laptop run
watch -n 1 curl <route_url>
- Requests go to upstream: http://192.168.99.186:80/anything with worker 1140
- Edit service via to have path
/get
instead of/anything
- Observe via logs the upstream changed to http://192.168.99.186:80/get with worker 1140
-
ps
shows active workers 1140 and 1141 - Remove one with kill 1140. New worker 1145 is created
- Observe upstream is still http://192.168.99.186:80/get with worker 1141
- Remove another one with kill 1141. New worker 1146 is created.
- Observe upstream is now http://192.168.99.186:80/anything served by worker 1145 ⚡
- After running
curl -XDELETE localhost:8001/cache
worker 1145 also serves http://192.168.99.186:80/get
Interestingly enough, repeating the steps above with switching the path from /get
to /anything
does not result in the issue -- this only worked reliably with switching the path from /anything
to /get
.
Are there any debug trace settings that could help pinpoint the root cause?
Kong 2.8.1 ,same issue with cached targets
Did anyone have a chance to look into this? Our production system is hit by this issue more and more over the last weeks, right now even newly created routes get "lost" on newly spawned nginx workers. I could see that on one Kong instance 50% of our requests succeeded, and 50% got a "no Route matched with those values". Our instance was running with 2 workers, and when the issue started one of the workers had just been killed and recreated.
It looks like newly spawned workers receive a stale configuration into their cache.
Is there any debug setting that could help us nail this issue down?
same here. kong still point to the lost targets instead the updated targets. in the logs,many "connection refused,connection timeouts" errors are generated.
Hello, there's several key important points on this issue:
@AndyEWang:
Kong 0.12.3 I'm afraid that no upgrades are planned in the near future.
We don't have the bandwidth to support kong versions from 2 mayor versions ago (almost 3 at this point). Many changes have happened since then. If you can't upgrade, there's very little we can do in order to help you.
We added a new custom plugin to an existed kong api
Writing custom plugins is difficult, especially with older versions of Kong - I am not sure 0.12.3 even had a PDK yet. I would suggest that if things were working fine before introducing this new custom plugin, then this new plugin would be the main culprint.
@23henne, @Benny-Git:
same issue but using 2.7.0 version Kong 2.8.1
The versions of kong you are using might present similar symptoms might present similar symtoms to @AndyEWang's, but all the underlying tech has changed. Also, you are not using a custom plugin like @AndyEWang is doing.
Tracking your situation using the same issue @AndyEWang created will only make things more confusing. I recommend you open a different ticket, putting as much information relevant to your issue on it.
Created new issue https://github.com/Kong/kong/issues/9090 for this.
I created similar issue but about target: #10020.