ingress-nginx Remove backend from external backends if same backend name

What this PR does / why we need it:

Remove backend from external backends if new backend has the same name, prevents using old cached external backend when type of backend is changed.

When changing Service object type from ExternalName to ClusterIP, the backend in backends_with_external_name in the lua balancer is never removed causing the External backend to keep serving traffic. This PR removes the balancer from backends_with_external_name if it has the same name.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Documentation only

Which issue/s this PR fixes

fixes #8440

How Has This Been Tested?

Update balancer.lua code with fix
Run make dev-env
Created a Deployment to serve a static page using tag nginx:latest
Created Service object of type ExternalName pointing to an external website, traffic is routed appropriately
Changed Service object to type ClusterIP to route traffic to Deployment pod, traffic is routed appropriately (prior to this PR, traffic was still routed using external backend).

Checklist:

[ ] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[x] I've read the CONTRIBUTION guide
[ ] I have added tests to cover my changes.
[ ] All new and existing tests passed.

Apr 04 '22 20:04 freddyesteban

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: freddyesteban / name: Freddy Esteban Perez (a826cea64c0612785b89f3a39f5b671334925fbd, fb2b55bd9dcd547bdff1036ccc959cfc8b73e66c)

Apr 04 '22 20:04 linux-foundation-easycla[bot]

@freddyesteban: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 04 '22 20:04 k8s-ci-robot

Hi @freddyesteban. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 04 '22 20:04 k8s-ci-robot

I think, there is too much assumption here
It will be nice to see how a user lands in the situation where there is a backend lingering.
I think there was a issue raised about a similar situation but there too, a detailed description and step-by-step instruction to reproduce the problem of lingering backend was not provided.
It could be very likely true that there is lingering backend that needs to be manually removed, but this comment post is to get clarity on the simple fact as to why someone would first create a ingress with a externalName type service as backend in the first place. And then follow it up with editing it instead of deleting the Ingress and creating a new one.

Apr 05 '22 02:04 longwuyuan

I think, there is too much assumption here

It will be nice to see how a user lands in the situation where there is a backend lingering.

I think there was a issue raised about a similar situation but there too, a detailed description and step-by-step instruction to reproduce the problem of lingering backend was not provided.

It could be very likely true that there is lingering backend that needs to be manually removed, but this comment post is to get clarity on the simple fact as to why someone would first create a ingress with a externalName type service as backend in the first place. And then follow it up with editing it instead of deleting the Ingress and creating a new one.

@longwuyuan thank you for taking the time to look at our PR.

Our use case for changing the Service type in flight is to spin down all pods if not in use (spin to zero), and point to an external service to signals the user that pod is spun down ( a please wait page) until it wakes up. Our automation "wakes up" the pods by scaling the deployment replicas back to 1 and the Service Object is switched to use the ClusterIP. This particular behavior worked on an older version of the controller, the nginx controller would cause a reload for the change though. In this version of the controller, it attempt to perform the change dynamically using the Lua balancer which is great so we can hopefully avoid the reload.

You've suggested why not delete the ingress object and we could do that but found that just updating the Service type avoids a reload. Our clusters are relative big and this could be very advantageous to our use case.

To replicate the issue, I've put together a step-by-step guide here.

To see the difference with our change, follow step-by-step guide here.

Why fixing removing the cached external backend with the same name is important to us? Changing Service type to ClusterIP is a dynamic reconfiguration without reload. If you look at logs I provided here.

We're aware that we could delete the ingress object or even update it in place and that would work but that causes a reload. Which it could be ok since the old ingress controller object was doing this but we'd like to take advantage of the dynamic reload our fix is providing.

Here are two step-by-step guide for the approach of deleting the ingress object and just recreating, the logs will show a backend reload, see here.

Here are two step-by-step guide for the approach of updating the ingress object to use a separate Service object, the logs will show a backend reload, see here.

Apr 05 '22 17:04 freddyesteban

Hi @freddyesteban , Thanks for the detailed explaining and the reproduce procedure. It helps. My first request is that you create a new issue and put all the details you explained here in that issue.

Please explode all the data related to reproducing the problem in that issue.
Link that issue here with the string fixes <pound/hash symbol> <issuenumber>

My next request is please write tests. I think there should be some assurance that just checking for a pre-existing stale backend in Lua does not interfere with any other codepath. Please write tests that you think will provide this assurance. I din't expect any panic for a if condition check, but a test should confirm that for users who don't create a externalName type service and edit it later, there is not going to be any impact. I don't even know if the externalName type existing but not being pointed to the internet but to a custom destination make sa difference or not. Basically, please write all the tests that will provide the assurance needed.

Apr 06 '22 00:04 longwuyuan

/kind feature

Apr 07 '22 04:04 longwuyuan

@freddyesteban on a very different note, if. you have already tried https://kubernetes.github.io/ingress-nginx/examples/customization/custom-errors/#custom-errors, please write a note on why custom-errors are not a preferred solution for the use case you described. It would be such a clean and supported solution to server the "please wait" page from custom-backend.

Apr 07 '22 04:04 longwuyuan

Please sign CLA

Apr 07 '22 15:04 tao12345666333

/assign

Apr 07 '22 15:04 tao12345666333

Please sign CLA

@tao12345666333 We were under the impression that we as a company had already done that but I think the project moved to EasyCLA and that's no longer the case. My manager has filed a ticket to get that fixed. Thank you.

Apr 07 '22 15:04 freddyesteban

@longwuyuan

Thank you. I created the issue and linked it to this PR. In regards to testing, the change would only remove the backend from backends_with_external_name after the sync_backend of a non-external backend sync, removing it from the externals would not affect a user not having an external backend. The lookup of the backend name in the table is safe and would not panic. If there's change between external types, e.g. external backend is reconfigures to point to a custom destination, it is not affected because in order for that code path, the user would have to change the type of the Service to non-external. I've attempted to add test before but in order to test the table backends_with_external_name, I'd have to break encapsulation because the function updating the table and the table itself are not exported. I'm working on it anyways atm. Thoughts on exporting the sync_backends and backends_with_external_name for testing purposes?

I'm new to Lua, apologies if there's a better way to approach the testing and if you have any suggestions or could point me in the right direction, I'd appreciate that.

Apr 07 '22 15:04 freddyesteban

@freddyesteban on a very different note, if. you have already tried https://kubernetes.github.io/ingress-nginx/examples/customization/custom-errors/#custom-errors, please write a note on why custom-errors are not a preferred solution for the use case you described. It would be such a clean and supported solution to server the "please wait" page from custom-backend.

For us at least, It's more of routing to a particular external service when the Service changes rather than creating a default backend that could handle our particular use case. With enough work, I think we could find multiple solutions including your suggestion of deleting ingress objects. We'd like to have our please wait service decoupled from the nginx controller deployments as it serves multiple clusters, that's just one factor.

Apr 07 '22 15:04 freddyesteban

Added unit testing. Needed to export some internals for balancer.lua to allow testing. Please let me know if this is not a desired way of testing it.

Apr 07 '22 18:04 freddyesteban

@longwuyuan @tao12345666333 thoughts on the changes?

Apr 13 '22 17:04 freddyesteban

/ok-to-test @tao12345666333 can you please take a look. It does make sense to me, but I'm a bit worried every time I mess with Lua code ;)

Thanks

May 01 '22 21:05 rikatz

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: freddyesteban To complete the pull request process, please ask for approval from tao12345666333 after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 02 '22 16:05 k8s-ci-robot

Sure. It's on my queue.

May 04 '22 06:05 tao12345666333

/test pull-ingress-nginx-test-lua

May 04 '22 06:05 tao12345666333

/retest

May 04 '22 06:05 tao12345666333

Errors in CI are not related to code changes, which may have something to do with test-infra. I will do a code review

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/ingress-nginx/8430/pull-ingress-nginx-test-lua/1521739764124880896/build-log.txt

build/run-in-docker.sh: line 65: USER: unbound variable
build/run-in-docker.sh: line 65: USER: unbound variable
build/run-in-docker.sh: line 65: docker: command not found
make: *** [Makefile:146: lua-test] Error 127

May 04 '22 06:05 tao12345666333

Errors in CI are not related to code changes, which may have something to do with test-infra. I will do a code review

Ricardo had to remove a if condition that checks for DIND, because prow was failing e2e but local/laptop e2e was working. Now you are reporting run-in-docker.sh related error message. Hope we are aware of any underlying infra/prow changes to avoid spiralling out of control. There was no announcement though and I had a success with e2e on laptop in the last 24 hours so surely its related to prow.

May 04 '22 06:05 longwuyuan

/retest

May 04 '22 13:05 tao12345666333

@tao12345666333 any updates on this or anything I should be doing to help get this over the line? TIA

Jun 07 '22 17:06 freddyesteban

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 05 '22 17:09 k8s-triage-robot

/remove-lifecycle stale

Sep 11 '22 12:09 freddyesteban

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 10 '22 13:12 k8s-triage-robot

/remove-lifecycle stale

Dec 12 '22 18:12 freddyesteban

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: freddyesteban Once this PR has been reviewed and has the lgtm label, please ask for approval from tao12345666333. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Apr 25 '23 16:04 k8s-ci-robot

@freddyesteban: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-ingress-nginx-test-lua	39ecf8b0382b5bd0964a9586f09e5a543d730227	link	true	`/test pull-ingress-nginx-test-lua`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Apr 25 '23 16:04 k8s-ci-robot