e2e/loadbalancer: added hairpin connection cases
What type of PR is this?
/kind bug /kind failing-test
What this PR does / why we need it:
Implementing the hairpin connection test cases, and exposing an issue on NLB with internal scheme which fails when the client is trying to access a service loadbalancer which is hosted in the same node.
The hairpin connection is caused by the client IP preservation attribute is set to true (default), and the service does not provide an interface to prevent the issue.
The e2e is expecting to pass to prevent permanent failures in CI, but it is tracked by an issue https://github.com/kubernetes/cloud-provider-aws/issues/1160.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Those tests are important to increase coverage of scenarios that CCM declares as supported.
I also believe we can remove the hairpin with scheme internet-facing (public) LBs because the source IPs would be traversing a VPC gateway (IGW/NGW) and masquerade the real source, not reproducing the problem we are trying to expose in https://github.com/kubernetes/cloud-provider-aws/issues/1160. Thoughts?
Does this PR introduce a user-facing change?:
NONE
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hi @mtulio. Thanks for your PR.
I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hi @kmala , would you mind stamping ok to test those new jobs, please?
I also have a question in the description related to the public cases, I used in the begging of exploration, but looks like we don't need it, LMK WDYT. Thanks
cc @elmiko
/ok-to-test
@mtulio: The following test failed, say
/retestto rerun all failed tests or/retest-requiredto rerun all mandatory failed tests:Test name Commit Details Required Rerun command pull-cloud-provider-aws-e2e 24a0041 link true
/test pull-cloud-provider-aws-e2eFull PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
CI infra issue.
/test pull-cloud-provider-aws-e2e
I am observing a permanent failure on CI when launching the cluster trying to use an image that is no longer available: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1161/pull-cloud-provider-aws-e2e/1934643980113285120#1:build-log.txt%3A751
s invalid: could not find Image for "099720109477/ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20250502.1"
Hi @kmala @elmiko, do you know if is this comes from the test framework or is it possible to use a valid image in CCM repo?
An issue has been opened to track the CI problem: https://github.com/kubernetes/cloud-provider-aws/issues/1167
/assign @elmiko @kmala
looks like pull-cloud-provider-aws-e2e is running (and stuck) in the last 48h. Just stopped it and trying to run again:
/test pull-cloud-provider-aws-e2e
/test pull-cloud-provider-aws-e2e-kubetest2
Looks like https://github.com/kubernetes/cloud-provider-aws/issues/1167 has been resolved. I manually stopped the running job (42h+); Triggering it again:
/test pull-cloud-provider-aws-e2e
Hi @kmala @dims , PTAL? Thanks
i think this mostly makes sense to me,
Thanks for reviewing.
curious to see the test output.
Me too! =] I just rebase and triggering the test again to try to get a good run. the tests so far are running til timeout.
/test pull-cloud-provider-aws-e2e
Looks like those tests aren't healthy yet to any k8s distribution, converting to draft to increase verbosity on distro used on CI.
Looks like those tests aren't healthy yet to any k8s distribution, converting to draft to increase verbosity on distro used on CI.
/test pull-cloud-provider-aws-e2e
/test pull-cloud-provider-aws-e2e
rebased and enhanced the SCC to prevent failing to schedule the test pod.
/test pull-cloud-provider-aws-e2e
tuning the timeout:
/test pull-cloud-provider-aws-e2e
/test all
updating improving the expected test failure to force to skip instead of passing:
/test pull-cloud-provider-aws-e2e
updating improving the expected test failure to force to skip instead of passing:
/test pull-cloud-provider-aws-e2e
I also removed the internet-facing tests that does not make sense to be tested in hairpin scenarios.
/test pull-cloud-provider-aws-e2e
This PR is ready for review. In the latest version I reviewed:
- SCC to fit CI constraints running with non-root
- improve the doc strings of new functions
- removed hairpin tests for internet-facing LBs as src IP is NATed and never hit frontend with node private IP to reproduce hairpin scenario
- Forcing to skip failed e2e on NLB-internal due known issue - preventing introducing permanent failure to CI
/assign dmis @kmala @elmiko
@mtulio: GitHub didn't allow me to assign the following users: dmis.
Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
This PR is ready for review. In the latest version I reviewed:
- SCC to fit CI constraints running with non-root
- improve the doc strings of new functions
- removed hairpin tests for internet-facing LBs as src IP is NATed and never hit frontend with node private IP to reproduce hairpin scenario
- Forcing to skip failed e2e on NLB-internal due known issue - preventing introducing permanent failure to CI
/assign dmis @kmala @elmiko
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
looking mostly good, i just have a question about the global variables.
Thanks, @elmiko, good suggestions. Fixed.
Hey @cartermckinnon , would you mind taking a look at this e2e test improvements to help us troubleshooting known issues? This won't fix https://github.com/kubernetes/cloud-provider-aws/issues/1160, but help us test it. Thanks!
@oliviassss can you take a look?
@mtulio Hi, thanks for the contribution. Just for my understanding, this PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test? What's the main purpose for adding the test cases?
From the AWS ELB doc I think the internal NLB will have hairpin issue with UDP or TCP_UDP protocol since the preserve client IP attribute cannot be disabled. (and I don't think CCM provide an annotation to disable this TG attributes anyway)
By default, client IP preservation is enabled (and can't be disabled) for instance and IP type target groups with UDP and TCP_UDP protocols. However, you can enable or disable client IP preservation for TCP and TLS target groups using the preserve_client_ip.enabled target group attribute.
ref: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/edit-target-group-attributes.html#client-ip-preservation
@mtulio Hi, thanks for the contribution. Just for my understanding, this PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test? What's the main purpose for adding the test cases?
From the AWS ELB doc I think the internal NLB will have hairpin issue with UDP or TCP_UDP protocol since the preserve client IP attribute cannot be disabled. (and I don't think CCM provide an annotation to disable this TG attributes anyway)
By default, client IP preservation is enabled (and can't be disabled) for instance and IP type target groups with UDP and TCP_UDP protocols. However, you can enable or disable client IP preservation for TCP and TLS target groups using the preserve_client_ip.enabled target group attribute.
ref: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/edit-target-group-attributes.html#client-ip-preservation
Hi @oliviassss , answering your questions:
his PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test?
Correct, we are skipping only the NLB test case as it works on CLB. https://github.com/kubernetes/cloud-provider-aws/pull/1161/files#diff-05b1c14f2de829d8a0c5f65b1b492a9ed9ab9d100ce6daa89d2d2347c8a14c77R122-R160
What's the main purpose for adding the test cases?
The main purpose is to expose the test scenario (hairpin connection) for both supported Load Balancer by CCM: CLB and NLB, skipping the NLB case to prevent know failures on CCM CI.
This problem was unknown until now, we are exposing this scenario as "known issue" to CCM in https://github.com/kubernetes/cloud-provider-aws/issues/1160, and using this e2e as a helper to reproduce and fix in follow up PRs.
Hi @oliviassss @kmala , would you mind also triage the related issue https://github.com/kubernetes/cloud-provider-aws/issues/1160, please?
/approve