cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

e2e/loadbalancer: added hairpin connection cases

Open mtulio opened this issue 6 months ago • 18 comments

What type of PR is this?

/kind bug /kind failing-test

What this PR does / why we need it:

Implementing the hairpin connection test cases, and exposing an issue on NLB with internal scheme which fails when the client is trying to access a service loadbalancer which is hosted in the same node.

The hairpin connection is caused by the client IP preservation attribute is set to true (default), and the service does not provide an interface to prevent the issue.

The e2e is expecting to pass to prevent permanent failures in CI, but it is tracked by an issue https://github.com/kubernetes/cloud-provider-aws/issues/1160.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Those tests are important to increase coverage of scenarios that CCM declares as supported.

I also believe we can remove the hairpin with scheme internet-facing (public) LBs because the source IPs would be traversing a VPC gateway (IGW/NGW) and masquerade the real source, not reproducing the problem we are trying to expose in https://github.com/kubernetes/cloud-provider-aws/issues/1160. Thoughts?

Does this PR introduce a user-facing change?:

NONE

mtulio avatar Jun 16 '25 02:06 mtulio

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 16 '25 02:06 k8s-ci-robot

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 16 '25 02:06 k8s-ci-robot

Hi @kmala , would you mind stamping ok to test those new jobs, please?

I also have a question in the description related to the public cases, I used in the begging of exploration, but looks like we don't need it, LMK WDYT. Thanks

cc @elmiko

mtulio avatar Jun 16 '25 02:06 mtulio

/ok-to-test

kmala avatar Jun 16 '25 06:06 kmala

@mtulio: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command pull-cloud-provider-aws-e2e 24a0041 link true /test pull-cloud-provider-aws-e2e Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

CI infra issue.

/test pull-cloud-provider-aws-e2e

mtulio avatar Jun 16 '25 12:06 mtulio

I am observing a permanent failure on CI when launching the cluster trying to use an image that is no longer available: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1161/pull-cloud-provider-aws-e2e/1934643980113285120#1:build-log.txt%3A751

s invalid: could not find Image for "099720109477/ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20250502.1"

Hi @kmala @elmiko, do you know if is this comes from the test framework or is it possible to use a valid image in CCM repo?

mtulio avatar Jun 16 '25 17:06 mtulio

An issue has been opened to track the CI problem: https://github.com/kubernetes/cloud-provider-aws/issues/1167

mtulio avatar Jun 17 '25 21:06 mtulio

/assign @elmiko @kmala

mtulio avatar Jun 18 '25 17:06 mtulio

looks like pull-cloud-provider-aws-e2e is running (and stuck) in the last 48h. Just stopped it and trying to run again:

/test pull-cloud-provider-aws-e2e

mtulio avatar Jun 18 '25 20:06 mtulio

/test pull-cloud-provider-aws-e2e-kubetest2

mtulio avatar Jun 19 '25 05:06 mtulio

Looks like https://github.com/kubernetes/cloud-provider-aws/issues/1167 has been resolved. I manually stopped the running job (42h+); Triggering it again:

/test pull-cloud-provider-aws-e2e

mtulio avatar Jun 20 '25 18:06 mtulio

Hi @kmala @dims , PTAL? Thanks

mtulio avatar Jun 23 '25 12:06 mtulio

i think this mostly makes sense to me,

Thanks for reviewing.

curious to see the test output.

Me too! =] I just rebase and triggering the test again to try to get a good run. the tests so far are running til timeout.

/test pull-cloud-provider-aws-e2e

mtulio avatar Jun 27 '25 03:06 mtulio

Looks like those tests aren't healthy yet to any k8s distribution, converting to draft to increase verbosity on distro used on CI.

mtulio avatar Jun 28 '25 03:06 mtulio

Looks like those tests aren't healthy yet to any k8s distribution, converting to draft to increase verbosity on distro used on CI.

/test pull-cloud-provider-aws-e2e

mtulio avatar Jun 28 '25 03:06 mtulio

/test pull-cloud-provider-aws-e2e

mtulio avatar Jun 28 '25 06:06 mtulio

rebased and enhanced the SCC to prevent failing to schedule the test pod.

/test pull-cloud-provider-aws-e2e

mtulio avatar Jul 04 '25 19:07 mtulio

tuning the timeout:

/test pull-cloud-provider-aws-e2e

mtulio avatar Jul 04 '25 21:07 mtulio

/test all

mtulio avatar Jul 05 '25 00:07 mtulio

updating improving the expected test failure to force to skip instead of passing:

/test pull-cloud-provider-aws-e2e

mtulio avatar Jul 05 '25 01:07 mtulio

updating improving the expected test failure to force to skip instead of passing:

/test pull-cloud-provider-aws-e2e

I also removed the internet-facing tests that does not make sense to be tested in hairpin scenarios.

/test pull-cloud-provider-aws-e2e

mtulio avatar Jul 05 '25 01:07 mtulio

This PR is ready for review. In the latest version I reviewed:

  • SCC to fit CI constraints running with non-root
  • improve the doc strings of new functions
  • removed hairpin tests for internet-facing LBs as src IP is NATed and never hit frontend with node private IP to reproduce hairpin scenario
  • Forcing to skip failed e2e on NLB-internal due known issue - preventing introducing permanent failure to CI

/assign dmis @kmala @elmiko

mtulio avatar Jul 05 '25 10:07 mtulio

@mtulio: GitHub didn't allow me to assign the following users: dmis.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

This PR is ready for review. In the latest version I reviewed:

  • SCC to fit CI constraints running with non-root
  • improve the doc strings of new functions
  • removed hairpin tests for internet-facing LBs as src IP is NATed and never hit frontend with node private IP to reproduce hairpin scenario
  • Forcing to skip failed e2e on NLB-internal due known issue - preventing introducing permanent failure to CI

/assign dmis @kmala @elmiko

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 05 '25 10:07 k8s-ci-robot

looking mostly good, i just have a question about the global variables.

Thanks, @elmiko, good suggestions. Fixed.

mtulio avatar Jul 07 '25 20:07 mtulio

Hey @cartermckinnon , would you mind taking a look at this e2e test improvements to help us troubleshooting known issues? This won't fix https://github.com/kubernetes/cloud-provider-aws/issues/1160, but help us test it. Thanks!

mtulio avatar Jul 11 '25 15:07 mtulio

@oliviassss can you take a look?

kmala avatar Jul 14 '25 23:07 kmala

@mtulio Hi, thanks for the contribution. Just for my understanding, this PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test? What's the main purpose for adding the test cases?

From the AWS ELB doc I think the internal NLB will have hairpin issue with UDP or TCP_UDP protocol since the preserve client IP attribute cannot be disabled. (and I don't think CCM provide an annotation to disable this TG attributes anyway)

By default, client IP preservation is enabled (and can't be disabled) for instance and IP type target groups with UDP and TCP_UDP protocols. However, you can enable or disable client IP preservation for TCP and TLS target groups using the preserve_client_ip.enabled target group attribute.

ref: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/edit-target-group-attributes.html#client-ip-preservation

oliviassss avatar Jul 15 '25 21:07 oliviassss

@mtulio Hi, thanks for the contribution. Just for my understanding, this PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test? What's the main purpose for adding the test cases?

From the AWS ELB doc I think the internal NLB will have hairpin issue with UDP or TCP_UDP protocol since the preserve client IP attribute cannot be disabled. (and I don't think CCM provide an annotation to disable this TG attributes anyway)

By default, client IP preservation is enabled (and can't be disabled) for instance and IP type target groups with UDP and TCP_UDP protocols. However, you can enable or disable client IP preservation for TCP and TLS target groups using the preserve_client_ip.enabled target group attribute.

ref: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/edit-target-group-attributes.html#client-ip-preservation

Hi @oliviassss , answering your questions:

his PR mostly adds the test coverage for internal NLB hairpin issue, but in the test case itself we expect the test to fail and skip the failed test?

Correct, we are skipping only the NLB test case as it works on CLB. https://github.com/kubernetes/cloud-provider-aws/pull/1161/files#diff-05b1c14f2de829d8a0c5f65b1b492a9ed9ab9d100ce6daa89d2d2347c8a14c77R122-R160

What's the main purpose for adding the test cases?

The main purpose is to expose the test scenario (hairpin connection) for both supported Load Balancer by CCM: CLB and NLB, skipping the NLB case to prevent know failures on CCM CI.

This problem was unknown until now, we are exposing this scenario as "known issue" to CCM in https://github.com/kubernetes/cloud-provider-aws/issues/1160, and using this e2e as a helper to reproduce and fix in follow up PRs.

mtulio avatar Jul 15 '25 22:07 mtulio

Hi @oliviassss @kmala , would you mind also triage the related issue https://github.com/kubernetes/cloud-provider-aws/issues/1160, please?

mtulio avatar Jul 16 '25 15:07 mtulio

/approve

kmala avatar Jul 17 '25 04:07 kmala