kong
kong copied to clipboard
Services unavailable after route updates in db-less mode
Is there an existing issue for this?
- [X] I have searched the existing issues
Kong version ($ kong version
)
2.7.0
Current Behavior
When running kong in DB-less mode w/ the ingress controller updating ingress routes can cause all services to be come unavailable for several minutes. Given the event driven nature of kubernetes updates to ingress rules can happen in rapid succession and in close proximity. When this happens the configuration sync + route update process between the ingress controller and Kong proxy seems to get into a very bad state.
Every Service behind kong will result in a 503 "failure to get a peer from the ring-balancer"
for 3-5 minutes. Its particularly bad with larger kubernetes deployments (100+ pods). In some situations it will never completely recover from where some registered routes will point to ips that don't exist and ~10% of requests will result in a 503. The only way to recover at this point is to restart all instances of kong.
- worker_consistency = strict
- worker_state_update_frequency = 5
Expected Behavior
When configuration is rebuilt existing routes to backing service should continue to work until they can safely be removed from the router. Bringing the entire system down for minutes at a time every time we touch the configuration isn't a viable production solution
Steps To Reproduce
Here is a rough output of the ingress that is used to configure Kong in DB-Less mode using the ingress controller
In this particular case, we added everything under logs.domain-two.com
. everything had previously existed. logs.domain-two.com
is a copy of logs.domain-one.com
note this is a rather large deployment w/ 300 pods/IP behind it.
Name: ingress-kong
Namespace: default
Default backend: default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
Rules:
Host Path Backends
---- ---- --------
logs.domain-one.com
/logs service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/supertenant service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/webhooks service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/akam
/aptible service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/p/ping service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/healthcheck service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
logs.domain-two.com
/logs service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/supertenant service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/webhooks service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/akamai service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/aptible service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/ping service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/p/ping service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
/healthcheck service-one:80 (XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZZ:7080,XX.XX.YYY.ZZ:7080 + 297 more...)
heroku.domain-one.com
/heroku/logplex service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...)
/heroku-addon/logplex service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...)
heroku.domain-two.com
/heroku/logplex service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...)
/heroku-addon/logplex service-two:80 (XX.XX.TTT.FFF:7080,XX.XX.666.444:7080,XX.XX.222.333:7080 + 57 more...)
api.domain-one.com
/v1 service-three:80 (XX.XX.AAA.BBB:7080,XX.XX.ZZZ.66:7080,XX.XX.241.175:7080 + 7 more...)
/v2/export service-three:80 (XX.XX.AAA.BBB:7080,XX.XX.ZZZ.66:7080,XX.XX.241.175:7080 + 7 more...)
api.domain-two.com
/v1 service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...)
/v2/export service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...)
/p/ping service-three:80 (XX.XX.DDD.EEE:7080,XX.XX.ZZZ.AA:7080,XX.XX.GGG.HHH:7080 + 7 more...)
api2.domain-one.com
/v1 service-four:80 (XX.AA.BBB.CCC:7080)
/v2/export service-four:80 (XX.AA.BBB.CCC:7080)
api2.domain-two.com
/v1 service-four:80 (XX.AA.BBB.CCC:7080)
/v2/export service-four:80 (XX.AA.BBB.CCC:7080)
/p/ping service-four:80 (XX.AA.BBB.CCC:7080)
app.domain-one.com
/ service-five:80 (XX.XX.000.111:7080,XX.XX.222.333:7080,XX.XX.444.555:7080 + 7 more...)
app.domain-two.com
/ service-five:80 (XX.XX.000.111:7080,XX.XX.222.333:7080,XX.XX.444.555:7080 + 7 more...)
app2.domain-one.com
/ service-six:80 (XX.XX.AAA.BBB:7080)
app2.domain-two.com
/ service-four:80 (XX.XX.AAA.BBB:7080)
tail.domain-one.com
/ service-seven:80 (XX.XX.YYY.TTT:7080,XX.XX.130.37:7080,XX.XX.ZZZ.RRR:7080 + 37 more...)
tail.domain-two.com
/ service-seven:80 (XX.XX.YYY.TTT:7080,XX.XX.130.37:7080,XX.XX.ZZZ.RRR:7080 + 37 more...)
cc: @rainest
I think it may be partly due to the health checker. When running in kubernetes the target IPs can change and some times frequently. If kong is trying to keep up with pod ip as the roll rather than the service name, that could produce this flaky behavior we are seeing.
I have a Kong deployment as well with around 50 pods and we are experiencing the same issue, or at least it's very identical.
Kong version ($ kong version)
2.8.0-b4d44dac8
I'm using a docker image with the following tag: kong:2.8.0-b4d44dac8-alpine
Current Behavior
Kong is running in DB-less mode. Sometimes (not always), when the ingress controller updates the routes / syncs a new configuration into Kong cause some services (not all) to be unavailable generally until the next sync.
I've been able to correlate the proxy returning 503 "failure to get a peer from the ring-balancer" and the level=info msg="successfully synced configuration to kong."
log messages.
For example, I have a first spike of 503 at 17:07 (UTC+1) (Prometheus scraping interval is not precise enough)
And the logs show the following:
The 503 spikes stop around 17:16:30 (UTC+1)
And the logs show:
Steps To Reproduce
In my configuration is not easy to reproduce, it happens sometimes. The CPU/Memory usage can't be correlated to the issue itself.
@fred-cardoso Good to know I'm not alone. This problem has been hard to replicate and has generally fixed itself. So we brushed it off and not critical. As we have moved more and larger services behind kong, it has gotten worse and taken longer to fix.
In the most recent case, it never recovered even after 2 hours. we had to manually restart all of the kong pods. Wondering if the sync + reconcile process is more accurate and stable in Hybrid mode?
For us it's getting critical since it affects frontends and those are clearly noticed by the users.
Wondering if the sync + reconcile process is more accurate and stable in Hybrid mode?
Unfortunately on our setup it's not "easy" to change the deployment and deploy the DB, but maybe you are right. Definely something worth testing. Even though I think the DB-less should work properly 😛
I agree, this was more a question for the kong folks. just looking for something that might help mitigate the problem.
Hello @esatterwhite,
thank you for reporting this issue. Are you seeing these periods of instability will all reconfigurations of Kong gateway, or just with some of them like @fred-cardoso?
Thank you! Hans
As an additional bit of information: We've found some situations in which multiple reconfiguration requests of Kong issued in short intervals could get it into an instable state with overly long response times. A fix for this issue is being tested, but it is not yet sure when it will be released.
thank you for reporting this issue. Are you seeing these periods of instability will all reconfigurations of Kong gateway, or just with some of them like @fred-cardoso?
It is hard to say, we certainly notice the problem much more when we change ingress rules. As that doesn't cause kong to restart. we run db-less on kubernetes and changing anything about the deployment configuration, env vars, etc causes the kong pods to restart.
We don't use a lot of plugins as of yet.
Not sure if this is helpful as well, but I also noticed that the pods don't even get the requests from Kong. What I do see is a drop in the requests, but they don't fail so it's really Kong not being able to connect to them.
As an additional bit of information: We've found some situations in which multiple reconfiguration requests of Kong issued in short intervals could get it into an instable state with overly long response times.
Yes, I'm pretty sure this is was caused the lingering problem. most of the 503s go away as the upstream targets are rebuild, but there was about 2 hours where about 10% of requests would 503. the only way to fix it was to restart the kong instances.
In particular, when running on kubernetes, scaling a deployment up/down or restarting one would cause dozens of upstream target rebuilds. as all of the IPs of the pods change.
for a large deployment - 300 pods, restarting 10% of pods at a time, as I understand it that is 30 router rebuilds / reconfigures in a very short period of time. It honesly has every one rather scared of running Kong in production at the moment, for a number of apps we've had to turn on the upstream service target bit for the ingress controller so kong isn't traking IP addresss. But this means we loose the load balancing behaviors of kong and put more pressure on kube proxy.
It happened again were a the synchronization was left in a bad state. We didn't change the configuration of anything directly. but some pods in our infrastructure that are associated to a kong ingress restarted.
hey -n 1000 https://xxxx.com
Summary:
Total: 3.7963 secs
Slowest: 0.4579 secs
Fastest: 0.0348 secs
Average: 0.1398 secs
Requests/sec: 263.4139
Total data: 17110 bytes
Size/request: 17 bytes
Response time histogram:
0.035 [1] |
0.077 [266] |■■■■■■■■■■■■■■■■■■■■■■■■■
0.119 [9] |■
0.162 [434] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.204 [150] |■■■■■■■■■■■■■■
0.246 [59] |■■■■■
0.289 [25] |■■
0.331 [22] |■■
0.373 [29] |■■■
0.416 [4] |
0.458 [1] |
Latency distribution:
10% in 0.0411 secs
25% in 0.0479 secs
50% in 0.1413 secs
75% in 0.1726 secs
90% in 0.2341 secs
95% in 0.3038 secs
99% in 0.3591 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0089 secs, 0.0348 secs, 0.4579 secs
DNS-lookup: 0.0038 secs, 0.0000 secs, 0.0764 secs
req write: 0.0000 secs, 0.0000 secs, 0.0036 secs
resp wait: 0.0533 secs, 0.0348 secs, 0.4578 secs
resp read: 0.0002 secs, 0.0000 secs, 0.0067 secs
Status code distribution:
[200] 705 responses
[503] 295 responses
restarting it was the only fix.
@hanshuebner @locao this was a community report that looked similar to the issue we were working on in EE PRs 3344 and 3212. Would it be possible to make an OSS image that includes those also?
#3207 sounds like the actual fix? The commit makes it sound like the fix is mainly to remove the error log. Does it prevent request from being sent to the missing upstream?
The issue that we've recently fixed caused Kong to stall when a new reconfiguration cycle was started while another one was active.
https://github.com/Kong/kong/commit/95d704ee6095648e97c39a5d86ede8ec2b7f208b https://github.com/Kong/kong/commit/200f56ee2e011e5c60420909a84d99ebf08f3059 https://github.com/Kong/kong/commit/ef58cdd37129c0b2ca7ec60e0fe19fbf581ef8ee https://github.com/Kong/kong/commit/de37b510034108dc347bf62fa8adae9ca0748d2e
Oh, those commits are really new, we are going to rebuild the image with @fred-cardoso, will test it and see if it gets better
@hanshuebner Thanks, this looks like it may be helpful but I'm not entirely sure the size of the configuration is entirely the problem. The problem seems to persist for long periods of time. After one or more reconfigurations happen there seem to be invalid upstream targets lingering which Kong keeps sending requests to event though the ips do not exist anymore.
I would think the health checker would eventually remove those from the balancer, or at least stop sending requests to them.
This is an interesting point. Do you have the health checks enabled @esatterwhite ? I don't have those. The health checks are only configured on the pods themselves, Kong is not doing them.
This is an interesting point. Do you have the health checks enabled @esatterwhite ? I don't have those. The health checks are only configured on the pods themselves, Kong is not doing them.
thats a good point. The configuration on the upstream exists, but the default check interval is 0
by default. So I suppose you'd have to manually configure one
If this problem continues, It sounds like we'd have to do that
Although it feels unnecessary. between kubernetes + kong. This feels like a behavior that shouldn't happen let alone need to be configured explicitly
We also se this a lot when things restart
ingress-kong-7fbc678578-4glkz kong-proxy 2022/06/22 15:30:03 [error] 1107#0: *6829049 [lua] balancers.lua:228: get_balancer(): balancer not found for test.default.80.svc, will create it, client: X.YYY.ZZ.11, server: kong, request: "GET /account/signin HTTP/1.1", host: "app.test.com"
The balancer loses track of everything for a while. Things are unusable while this is happening
The issue that we've recently fixed caused Kong to stall when a new reconfiguration cycle was started while another one was active.
@hanshuebner Is there an image/tag we can pull down to try?
@hanshuebner Is there an image/tag we can pull down to try?
These commits are on the master branch, but they're not yet part of a release. If you want to try them, you'll have to build Kong yourself. Kong 2.8 is planned to be released soon, but I'm not able to give you an exact date.
@hanshuebner Is there an image/tag we can pull down to try?
With @fred-cardoso we are currently testing kong/kong:2.8.0-d648489b6-alpine which seems to be working great but still some 50Xs. In the next few days we are going to decide if we keep running the nightly build or revert back to the stable one
If it is still reporting 503s what was the improvement?
If it is still reporting 503s what was the improvement?
It's reporting way less 502's, and I'm not sure it's related to kong, we need to understand why they are happening. We are almost at 0 rps for the last day so I would say it is an improvement. We need to give it 1-2 more days to be sure since the bug was happening randomly
For us it is pretty reproducible at scale. Restarting the kubernetes deployment w/ 300 pods will trigger many reconfigurations and IP changes to occur in close proximity. Several, if not all of the services registered to the kong ingress will be unavailable. Some times that will linger for several hours.
We know its kong because customers report getting the failure to get peer from ring balancer error. and the logs would indicate kong sent a request to an IP that isn't there anymore (cant connect to). All the pods are up + responsive
We are going to come back to you in a few days with our conclusions. We also know that's it's coming from kong for the same reasons as yours
Hey folks,
Even though the behaviors are indeed similar, I'm afraid @esatterwhite and @fred-cardoso have different issues.
About Kong 2.7:
-
balancer not found for <upstream>, will create it
is mislabeled as anerror
message. It's informative and you can safely ignore it. It means that the load-balancers are still being created and this particular one was not touched yet, but as it is needed, it will be created before it's turn. - I think it may be partly due to the health checker. When running in kubernetes the target IPs can change and some times frequently. If kong is trying to keep up with pod ip as the roll rather than the service name, that could produce this flaky behavior we are seeing. Do you mean Gateway health-checker or KIC health-checker? The Gateway health-checker does not resolve DNS records, so it doesn't matter if it's active or not. The DNS records are resolved by the load-balancer, if there's a problem here, proxying will fail either way.
- Do you see any
error
level log messages? - I would recommend you to try Kong 2.8.1, with health-checks enabled (see below). There are several improvements that may be related to the issues you are seeing, e.g. #8344, #8204 and #8634.
About Kong 2.8:
- Kong 2.8 has a big change on the health-checker side. Now the targets health status is kept between config reloads. So, if a target is unhealthy, making a change to the configuration doesn't make the target start to proxy again. But with that we may have introduced an issue (not yet confirmed):
- If the health-checks are not being used by any target, they are not attached to the upstream.
- When they are attached, at load-balancer creation-time, they ask the load-balancer to resolve the DNS records from its target.
- If they are not attached, that doesn't happen immediately. So, if the DNS server takes a bit longer to resolve the addresses, when the balancer is available the IPs were not yet resolved, hence the 503s.
@fred-cardoso should get rid of the 503s by enabling health-checks on the upstream entities. Please note that this is still a possibility, we were not able to reproduce the behavior locally. Here are the docs on enabling the health-checks. Passive health-checking would be enough.
Hi, As promised, here I am with our conclusions from our tests with @fred-cardoso :
Looks like kong/kong:2.8.0-d648489b6-alpine fixed our issue. We don't have the "failure to get a peer from the ring-balancer"
message anymore which is a good news :+1: !
Also, @locao we aren't using kong health-checks, only k8s ones. If the 503's start again, we'll try to enable them
For us, as far as are aware, the issue is fixed but we'll be monitoring the situation and this issue in case something new comes up