Remove backend from external backends if same backend name
What this PR does / why we need it:
Remove backend from external backends if new backend has the same name, prevents using old cached external backend when type of backend is changed.
When changing Service object type from ExternalName to ClusterIP, the backend in backends_with_external_name in the lua balancer is never removed causing the External backend to keep serving traffic. This PR removes the balancer from backends_with_external_name if it has the same name.
Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Documentation only
Which issue/s this PR fixes
fixes #8440
How Has This Been Tested?
- Update balancer.lua code with fix
- Run
make dev-env - Created a Deployment to serve a static page using tag
nginx:latest - Created Service object of type
ExternalNamepointing to an external website, traffic is routed appropriately - Changed Service object to type
ClusterIPto route traffic to Deployment pod, traffic is routed appropriately (prior to this PR, traffic was still routed using external backend).
Checklist:
- [ ] My change requires a change to the documentation.
- [ ] I have updated the documentation accordingly.
- [x] I've read the CONTRIBUTION guide
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: freddyesteban / name: Freddy Esteban Perez (a826cea64c0612785b89f3a39f5b671334925fbd, fb2b55bd9dcd547bdff1036ccc959cfc8b73e66c)
@freddyesteban: This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @freddyesteban. Thanks for your PR.
I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
- I think, there is too much assumption here
- It will be nice to see how a user lands in the situation where there is a backend lingering.
- I think there was a issue raised about a similar situation but there too, a detailed description and step-by-step instruction to reproduce the problem of lingering backend was not provided.
- It could be very likely true that there is lingering backend that needs to be manually removed, but this comment post is to get clarity on the simple fact as to why someone would first create a ingress with a externalName type service as backend in the first place. And then follow it up with editing it instead of deleting the Ingress and creating a new one.
- I think, there is too much assumption here
- It will be nice to see how a user lands in the situation where there is a backend lingering.
- I think there was a issue raised about a similar situation but there too, a detailed description and step-by-step instruction to reproduce the problem of lingering backend was not provided.
- It could be very likely true that there is lingering backend that needs to be manually removed, but this comment post is to get clarity on the simple fact as to why someone would first create a ingress with a externalName type service as backend in the first place. And then follow it up with editing it instead of deleting the Ingress and creating a new one.
@longwuyuan thank you for taking the time to look at our PR.
Our use case for changing the Service type in flight is to spin down all pods if not in use (spin to zero), and point to an external service to signals the user that pod is spun down ( a please wait page) until it wakes up. Our automation "wakes up" the pods by scaling the deployment replicas back to 1 and the Service Object is switched to use the ClusterIP. This particular behavior worked on an older version of the controller, the nginx controller would cause a reload for the change though. In this version of the controller, it attempt to perform the change dynamically using the Lua balancer which is great so we can hopefully avoid the reload.
You've suggested why not delete the ingress object and we could do that but found that just updating the Service type avoids a reload. Our clusters are relative big and this could be very advantageous to our use case.
To replicate the issue, I've put together a step-by-step guide here.
To see the difference with our change, follow step-by-step guide here.
Why fixing removing the cached external backend with the same name is important to us? Changing Service type to ClusterIP is a dynamic reconfiguration without reload. If you look at logs I provided here.
We're aware that we could delete the ingress object or even update it in place and that would work but that causes a reload. Which it could be ok since the old ingress controller object was doing this but we'd like to take advantage of the dynamic reload our fix is providing.
Here are two step-by-step guide for the approach of deleting the ingress object and just recreating, the logs will show a backend reload, see here.
Here are two step-by-step guide for the approach of updating the ingress object to use a separate Service object, the logs will show a backend reload, see here.
Hi @freddyesteban , Thanks for the detailed explaining and the reproduce procedure. It helps. My first request is that you create a new issue and put all the details you explained here in that issue.
- Please explode all the data related to reproducing the problem in that issue.
- Link that issue here with the string
fixes <pound/hash symbol> <issuenumber>
My next request is please write tests. I think there should be some assurance that just checking for a pre-existing stale backend in Lua does not interfere with any other codepath. Please write tests that you think will provide this assurance. I din't expect any panic for a if condition check, but a test should confirm that for users who don't create a externalName type service and edit it later, there is not going to be any impact. I don't even know if the externalName type existing but not being pointed to the internet but to a custom destination make sa difference or not. Basically, please write all the tests that will provide the assurance needed.
/kind feature
@freddyesteban on a very different note, if. you have already tried https://kubernetes.github.io/ingress-nginx/examples/customization/custom-errors/#custom-errors, please write a note on why custom-errors are not a preferred solution for the use case you described. It would be such a clean and supported solution to server the "please wait" page from custom-backend.
Please sign CLA
/assign
Please sign CLA
@tao12345666333 We were under the impression that we as a company had already done that but I think the project moved to EasyCLA and that's no longer the case. My manager has filed a ticket to get that fixed. Thank you.
@longwuyuan
Thank you. I created the issue and linked it to this PR. In regards to testing, the change would only remove the backend from backends_with_external_name after the sync_backend of a non-external backend sync, removing it from the externals would not affect a user not having an external backend. The lookup of the backend name in the table is safe and would not panic. If there's change between external types, e.g. external backend is reconfigures to point to a custom destination, it is not affected because in order for that code path, the user would have to change the type of the Service to non-external. I've attempted to add test before but in order to test the table backends_with_external_name, I'd have to break encapsulation because the function updating the table and the table itself are not exported. I'm working on it anyways atm. Thoughts on exporting the sync_backends and backends_with_external_name for testing purposes?
I'm new to Lua, apologies if there's a better way to approach the testing and if you have any suggestions or could point me in the right direction, I'd appreciate that.
@freddyesteban on a very different note, if. you have already tried https://kubernetes.github.io/ingress-nginx/examples/customization/custom-errors/#custom-errors, please write a note on why custom-errors are not a preferred solution for the use case you described. It would be such a clean and supported solution to server the "please wait" page from custom-backend.
For us at least, It's more of routing to a particular external service when the Service changes rather than creating a default backend that could handle our particular use case. With enough work, I think we could find multiple solutions including your suggestion of deleting ingress objects. We'd like to have our please wait service decoupled from the nginx controller deployments as it serves multiple clusters, that's just one factor.
Added unit testing. Needed to export some internals for balancer.lua to allow testing. Please let me know if this is not a desired way of testing it.
@longwuyuan @tao12345666333 thoughts on the changes?
/ok-to-test @tao12345666333 can you please take a look. It does make sense to me, but I'm a bit worried every time I mess with Lua code ;)
Thanks
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: freddyesteban To complete the pull request process, please ask for approval from tao12345666333 after the PR has been reviewed.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Sure. It's on my queue.
/test pull-ingress-nginx-test-lua
/retest
Errors in CI are not related to code changes, which may have something to do with test-infra. I will do a code review
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/ingress-nginx/8430/pull-ingress-nginx-test-lua/1521739764124880896/build-log.txt
build/run-in-docker.sh: line 65: USER: unbound variable
build/run-in-docker.sh: line 65: USER: unbound variable
build/run-in-docker.sh: line 65: docker: command not found
make: *** [Makefile:146: lua-test] Error 127
Errors in CI are not related to code changes, which may have something to do with test-infra. I will do a code review
Ricardo had to remove a if condition that checks for DIND, because prow was failing e2e but local/laptop e2e was working.
Now you are reporting run-in-docker.sh related error message. Hope we are aware of any underlying infra/prow changes to avoid spiralling out of control. There was no announcement though and I had a success with e2e on laptop in the last 24 hours so surely its related to prow.
/retest
@tao12345666333 any updates on this or anything I should be doing to help get this over the line? TIA
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: freddyesteban Once this PR has been reviewed and has the lgtm label, please ask for approval from tao12345666333. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@freddyesteban: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-ingress-nginx-test-lua | 39ecf8b0382b5bd0964a9586f09e5a543d730227 | link | true | /test pull-ingress-nginx-test-lua |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.