ingress-nginx
ingress-nginx copied to clipboard
Add least_connections load balancing algorithm.
This commit adds a new option to the load-balance setting, 'least_connections'. In this mode the controller keeps a count of connections to each upstream and routes incoming connections to the upstream with the least connections open. This is recommended for use with evenly-resourced upstream servers when requests have a broad distribution in time to process, for example if some requests require the upstream server to make a connection to a slow external service.
If all requests take a fairly similar time to process or the upstreams serve at different speeds then ewma or round_robin are likely more appropriate.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] CVE Report (Scanner found CVE and adding report)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Documentation only
How Has This Been Tested?
This PR includes Lua unit tests and end-to-end tests written using the ginkgo framework. The framework has been extended somewhat to support end-to-end tests that require multiple upstreams with different configurations. All tests have been run including existing tests, and all tests have passed.
Checklist:
- [x] My change requires a change to the documentation.
- [x] I have updated the documentation accordingly.
- [x] I've read the CONTRIBUTION guide
- [x] I have added unit and/or e2e tests to cover my changes.
- [x] All new and existing tests passed.
- [x] Added Release Notes.
Does my pull request need a release note?
Added a new load-balancing algorithm 'least_connections'. If the load-balance setting in the config map is set to 'least_connections' the controller will attempt to route incoming connections to one of the upstream endpoints with fewest current open connections.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: alowde / name: Ashley Lowde (ce63bec742269c60fd03b0e6f0a0dcba0f7338bd)
@alowde: This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Welcome @alowde!
It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.
You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.
You can also check if kubernetes/ingress-nginx has its own contribution guidelines.
You may want to refer to our testing guide if you run into trouble with your tests not passing.
If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!
Thank you, and welcome to Kubernetes. :smiley:
Hi @alowde. Thanks for your PR.
I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: alowde / name: Ashley Lowde (e2b92f08d2f3e2851815298f7af0d8dff4817915)
Hey @adkafka, thanks very much for taking the time to review this. I've added commits addressing some of the issues and would be interested to discuss the rest further.
Another worry I have with this implementation though, is we may end up choosing the same host over and over. I wonder if we could to least connection but by bucket. Ie, split the peers into N buckets (N=10?), and choose a random host within the bucket.
If there's multiple hosts with equal lowest connection count then this implementation will select one at random. If one host is significantly faster than the others it will get repeated requests, but the slower hosts will still get new requests as they complete old ones.
Originally my thinking was that if one server is handling most of the requests while others are slow, then the fast server should receive the requests. Your comment prompted me though to think about cases where one server is rapidly responding to requests with an error response (because of misconfiguration, error condition, or similar) and others are taking more time but succeeding. In this case we would actually produce a much worse result unless and until a health check kicked in and removed the bad peer.
I can think of a few options for handling this scenario and I'd be interested to hear thoughts from anyone:
- Ignore it, and document the issue.
- Randomly select from n servers with the lowest connection count, where n is adjustable or adaptive (i.e. bucketing with just the lowest bucket considered). Partly mitigates the 'one bad peer' effect at the cost of less predictable behaviour.
- Add an option for minimum connection count, effectively only balancing based on minimum connections above x. This would mean that peers with less than x open requests would be considered equal, which would reduce the impact of a peer returning quickly with a bad response.
- Doesn't actually resolve the issue, just dilutes it at the cost of extra complexity.
- Add an option for minimum request time so that the open request counter delays decrement if a response returns too quickly.
- Comes close to solving the problem but would require substantial additional complexity, and still requires the user to predict a good value for their workload.
- Same as 4, but for 5xx responses only. Possibly a more user-friendly implementation, but it makes me uneasy to be mixing application-layer logic here.
- Go crazy, implement full health-checking of responses and adjust weighting of peers based on (say) percentage of successful requests. Significantly more complexity/risk, and almost certainly more complexity than the ingress-nginx project maintainers are happy to take on.
Out of these I think 1, 2, 3, and 5 are reasonable. I think 1 and 5 have the most predictable behaviour. 2 feels like it doesn't offer a complete enough solution to be worth the trade off in predictability, but I'm certainly happy to discuss it further.
What do you think?
Out of these I think 1, 2, 3, and 5 are reasonable. I think 1 and 5 have the most predictable behaviour. 2 feels like it doesn't offer a complete enough solution to be worth the trade off in predictability, but I'm certainly happy to discuss it further.
What do you think?
I think option #1 is best for the scope of this PR. It is the solution of least surprise. I am curious how other load balancers handle these failures though. I know nginx and HA Proxy both have a native least connections LB. Maybe we can look into that a little bit?
Another idea (likely out of scope of this PR) is instead of choosing between the endpoints with the least connections, we place the endpoints in some data structure and choose an endpoint randomly such that the endpoints with fewer connections are more likely to be chosen. This seems like the best balance between converging towards even connection count and limiting the blast radius of a single bad backend. I'm not sure how to implement this yet though...
I think option #1 is best for the scope of this PR. It is the solution of least surprise. I am curious how other load balancers handle these failures though. I know nginx and HA Proxy both have a native least connections LB. Maybe we can look into that a little bit?
I've now looked at the documentation for both HAProxy and Nginx Plus and it's interesting to see that neither of them mention this particular issue. I suspect the tighter integration of health-checks and load balancing in both of these means that it's not a major concern. Nginx Plus does have another load balancing mode that integrates least connections with a weighted moving average response time that would be less sensitive to this issue, but it still doesn't actually mitigate it.
Another idea (likely out of scope of this PR) is instead of choosing between the endpoints with the least connections, we place the endpoints in some data structure and choose an endpoint randomly such that the endpoints with fewer connections are more likely to be chosen. This seems like the best balance between converging towards even connection count and limiting the blast radius of a single bad backend. I'm not sure how to implement this yet though...
Agree that it's not in scope for this PR, but I'm not against implementing a more interesting structure if we can show an advantage for it. If we identify performance issues with the current implementation at scale I would consider assessing peer connection counts asynchronously, and at that point it would be easy to update a second table with an appropriate set of peers to be chosen from randomly.
Hey @alowde, I see this PR hasn't had any changes for a while now. What work remains? Anything I can do to help? I would love to use least_connections soon :)
Hey @adkafka, yeah this PR has gotten a little stale - I moved on from the company that was using this and haven't looked at it for a bit. I'd still love to get it fit for merging though, and I'm hopefully able to spend some more time on it now. Any help would be appreciated. I'll see if I can get it rebased OK this week, and then we'll need to review the existing tests and confirm it's all passing. If I recall correctly some of the test suite had started failing for unrelated features and I didn't track down whether that was because of the code I introduced, updates upstream, or random fluctuations of the universe :). Feedback/critique would also be welcome on any other aspects of the PR.
Thanks! After you rebase this PR, feel free to ping me finish review it
Just a brief update - I've been actively working on this PR this week but rebasing is taking some work due to overlapping changes with other feature tests added since I started. Currently hoping to have an updated version ready to go within the next week or so, with new cleaner e2e tests.
@alowde Great to hear :)
I have been thinking a bit about the conversation thread we had about the downsides to this algorithm: https://github.com/kubernetes/ingress-nginx/pull/9025#issuecomment-1304728049. Specifically, the issue that with a true "least connections" algorithm, all ingress-nginx pods will send traffic to the same pod if that pod just came up. This seems like a potential issue under at least two conditions:
- An existing deployment is scaled out horizontally, perhaps to handle more load. This will result in only the new pods receiving traffic, and potentially becoming overwhelmed. If we are scaling a deployment from 100 pods to 105, we will have 5 pods handing an incoming request rate that was previously handled by 100 pods (20x the request rate per pod).
- A pod gets in a bad state for some reason, and is not holding onto new connections. This will result in a complete outage, because all nginx pods send traffic to it.
I believe there is a good solution to this though. I learned recently about "Power of Two Choices" load balancing [1] [2]. It is a simpler algorithm, and strikes a good balance between round-robin / random and least connections. Instead of choosing the single backend with the least connections, we would chose two backends at random from the list, and use the one with the least number of connections. The algorithm is simpler to implement than pure least connections, because we don't need to sort through the entire list of backends, we just need to compare two.
I'm bringing this up because for the use case I have in mind, this "Power of Two Choices" + Least Connections load balancing seems idea. It strikes the right balance between spreading connections out over many pods, but still distributing load somewhat evenly. By no means do we have to go this route in this PR, but I want to bring it up for discussion. I'm curious if other use cases would also benefit from this approach, and what other folks think.
Why was this closed?
@reidmeyer to rebase this PR I ended up more or less dropping the branch and recreating it, as git rebase wasn't able to produce a clean set of changes. It looks like that caused GitHub to close the PR, but all good now :).
@tao12345666333 this PR is now ready for review, thank you!
@adkafka thanks for introducing "Power of Two Choices", that's very interesting. The existing behaviour of this algorithm is working fine for the intended use case, but it sounds like using that change could improve things further. Unless the maintainers have a different preference I'd like to leave this PR as is and then raise a new PR to either change the behaviour of the least_connections algorithm or (my preference) add a new algorithm option.
I need to read and digest the linked paper properly, but I'm wondering how random-two behaves under different distributions of requests. It seems like it could still be susceptible to the issue that originally prompted me to implement least_connections if a small proportion of the incoming requests take a substantially longer time to process. Happy to help test this though, or implement, or test another implementation :).
Chiming in with a use-case that would prefer a more strict least_connections approach: Long-running connections like websockets.
I think that could still benefit from some sort of ramp-up though.
Chiming in with a use-case that would prefer a more strict least_connections approach: Long-running connections like websockets.
I think that could still benefit from some sort of ramp-up though.
I have a similar use case: server sent events.
I'd like to leave this PR as is and then raise a new PR to either change the behaviour of the least_connections algorithm or (my preference) add a new algorithm option
That sounds good to me! Perhaps, in a follow-up PR we (I'm happy to do so) can propose a new annotation on the ingress that toggles this behavior (choose two backends randomly, select the one with the fewest connections) on.
@adkafka,
That sounds good to me! Perhaps, in a follow-up PR we (I'm happy to do so) can propose a new annotation on the ingress that toggles this behavior (choose two backends randomly, select the one with the fewest connections) on.
Sounds good. Please feel free to @ me if/when you do propose a new annotation and I'll be keen to help in any way I can. I think there's an opportunity here to write end-to-end tests that could exercise the load-balancing algorithms with a variety of different traffic patterns, and I'd be very interested to see the outcomes.
Several of the comment threads are marked "resolved", but the outstanding issue still remains.
I'm very sorry about this; in the rebase process it appears I was drawing on an old commit to recreate the PR. I've reviewed the conversation and re-resolved the issues raised. Thanks again for the review and your patience.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
A few comments:
- This is not a least connections load balancing, it's load balancing based on the least number of in-flight requests per server. It's a decent approximation of connections, but just wanted to point this out. You might want to rename the algorithm to better represent what it does.
- I'd like to see how this implementation behaves in a k8s cluster with rolling endpoints. How does that compare to EWMA?
- Similar to above, we need some benchmarking numbers.
- I'd suggest using power of two choice technique used in the existing ewma lb algorithm, but instead of using response time as metric, just use the number of active/in-flight requests per server. Moreover, I don't think you need to synchronize request stats across NGINX workers, you can keep it per worker. It's desired in the ewma because the stats are coming from the endpoints in the response headers.
A couple things here:
- I agree with everything @ElvinEfendi says, i trust his judgement on LUA issues Id like to see what @tao12345666333 thinks as well
- https://github.com/kubernetes/ingress-nginx/issues/9950
- https://github.com/kubernetes/ingress-nginx/issues/9949
/hold
Regarding what @ElvinEfendi pointed out about in-flight requests, I have a question: As mentioned I hope to use least_conn for websockets, would those be considered in-flight requests here?
And regarding the algorithm, is there a reason you don't just mimic what already exists in nginx, on a high level? I'm not well-versed in nginx or ingress-nginx internals mind you :)
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: adkafka, alowde Once this PR has been reviewed and has the lgtm label, please assign puerco for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
I'd like to see how this implementation behaves in a k8s cluster with rolling endpoints. How does that compare to EWMA?
@ElvinEfendi it seems to not particularly like those, at the moment. It creates the same "desyncing" situation ive posted in a few threads;
/assign @tao12345666333
Sorry for the long delay.
I will be back on Tuesday
what is the status for this pull request? @tao12345666333