aws-load-balancer-controller icon indicating copy to clipboard operation
aws-load-balancer-controller copied to clipboard

Add support for VPC Endpoint Services

Open hintofbasil opened this issue 3 years ago • 54 comments

Issue

https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1859

Description

This change adds support for creating, deleting and updating VPC endpoint services along with their permissions. It supports the three suggested configuration options - allowed priciples, acceptance required, and private DNS name.

An enable annotation was also added so support enabling/disabling the feature on a per load balancer basis. The proposed design only supported enabling/disabling this at the controller level.

Outputting the created VPC Endpoint Service parameters is deliberatly not supported as there are still open questions as to the best approach to be used and this change is already far too big and adding more to it does not seem sensible. This feature can be added in a follow-up change.

This change has been manually tested on a Kubernetes service.

I apologise for the large PR. I couldn't determine suitable breakpoints to split this change up into smaller changes. I can do so if you can suggest the break points. I'm also fairly new to Golang so am very happy for any feedback as I know the code quality isn't amazing here.

Checklist

  • [x] Added tests that cover your change (if possible)
  • [x] Added/modified documentation as required (such as the README.md, or the docs directory)
  • [x] Manually tested
  • [x] Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! :exploding_head:

  • [ ] Backfilled missing tests for code in same general area :tada:
  • [ ] Refactored something and made the world a better place :star2:

hintofbasil avatar May 02 '22 19:05 hintofbasil

Hi @hintofbasil. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar May 02 '22 19:05 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hintofbasil To complete the pull request process, please assign kishorj after the PR has been reviewed. You can assign the PR to them by writing /assign @kishorj in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar May 02 '22 19:05 k8s-ci-robot

@hintofbasil, thanks for your changes. We are currently finalizing the v2.4.2 patch release, I will review your changes after the patch release.

kishorj avatar May 05 '22 21:05 kishorj

Codecov Report

Patch coverage: 63.57% and project coverage change: +0.48 :tada:

Comparison is base (ff8c13d) 55.05% compared to head (afa4b7e) 55.53%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2636      +/-   ##
==========================================
+ Coverage   55.05%   55.53%   +0.48%     
==========================================
  Files         148      157       +9     
  Lines        8532     9002     +470     
==========================================
+ Hits         4697     4999     +302     
- Misses       3505     3662     +157     
- Partials      330      341      +11     
Impacted Files Coverage Δ
pkg/config/addons_config.go 0.00% <0.00%> (ø)
pkg/deploy/elbv2/listener_manager.go 1.08% <ø> (ø)
pkg/deploy/stack_deployer.go 0.00% <0.00%> (ø)
pkg/deploy/tracking/provider_mocks.go 0.00% <0.00%> (ø)
pkg/networking/vpc_endpoint_service_info.go 0.00% <0.00%> (ø)
pkg/networking/vpc_endpoint_service_manager.go 0.00% <0.00%> (ø)
...g/networking/vpc_endpoint_service_manager_mocks.go 0.00% <0.00%> (ø)
pkg/service/model_build_endpoint_service.go 54.00% <54.00%> (ø)
pkg/service/model_builder.go 86.86% <60.00%> (-1.43%) :arrow_down:
pkg/deploy/ec2/tagging_manager__mocks.go 69.69% <69.69%> (ø)
... and 5 more

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar May 05 '22 21:05 codecov-commenter

Hi @kishorj, thanks for agreeing to take a look. A few people in Skyscanner have agreed to review it in the mean time. Matteo has already put one review in. Hopefully this makes things easier for you.

We are planning on deploying this internally in the next few weeks. We understand it likely won't be reviewed by then - but would it be possible to get the e2e tests ran against this? I would make us feel much more confident in running a fork and setting these up ourselves seems unrealistic.

hintofbasil avatar May 06 '22 11:05 hintofbasil

/ok-to-test

kishorj avatar May 11 '22 16:05 kishorj

This has been rebased and the formatting changes above implemented.

hintofbasil avatar May 23 '22 20:05 hintofbasil

I'm unable to replicate the e2e test failure.

I've added the following resources to a cluster running this branch of the controller.

apiVersion: v1
kind: Namespace
metadata:
  name: test
---
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: aws-load-balancer-controller-test
spec:
  controller: ingress.k8s.aws/alb
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: aws-load-balancer-controller-test
  namespace: test
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/subnets: subnet-id-1,subnet-id-2
spec:
  ingressClassName: aws-load-balancer-controller-test
  rules:
  - host: ""
    http:
      paths:
      - path: /path
        pathType: Exact
        backend:
          service:
            name: aws-load-balancer-controller-test
            port:
              name: http
---
apiVersion: v1
kind: Service
metadata:
  name: aws-load-balancer-controller-test
  namespace: test
spec:
  ports:
  - name: http
    port: 80
    targetPort: http
  selector:
    service: 'null'
  type: NodePort

This brings up a load balancer which successfully provisions.

I've then deleted the Ingress and the load balancer successfully deletes. Then deleted the namespace which also successfully deletes.

From what I can see this follows the exact logic of the test.

As such I'm going to trigger a rerun to see if it was a flake.

hintofbasil avatar May 24 '22 17:05 hintofbasil

/retest

hintofbasil avatar May 24 '22 17:05 hintofbasil

How does this approach differ from @rifelpet's initial pass in #1948?

seh avatar May 30 '22 02:05 seh

I didn't see it while reading the code, but I'm wondering whether there are more IAM actions that we'll need to allow in the IAM policy used by this controller.

seh avatar May 30 '22 02:05 seh

I didn't see it while reading the code, but I'm wondering whether there are more IAM actions that we'll need to allow in the IAM policy used by this controller.

You were correct, after adding these the tests are passing. Thanks!

hintofbasil avatar May 30 '22 13:05 hintofbasil

How does this approach differ from @rifelpet's initial pass in https://github.com/kubernetes-sigs/aws-load-balancer-controller/pull/1948?

This is a continuation of their work. Rifelpet provided a rough outline for the code; function definitions and structs mainly. This adds the actual implementation to make the feature work. It sticks pretty closely to Rifelpet's original design - and even includes their commits showing how it was built up from there.

hintofbasil avatar May 30 '22 13:05 hintofbasil

This is a continuation of their work.

Thank you for the detailed explanation.

I was going to say earlier this morning that configuring the private DNS name on the VPC endpoint service side seemed backwards to me, thinking that it would be consumers creating interface VPC endpoints that would decide on the name, but then I went back and read the AWS documentation on the feature, and see that your design matches that feature's interface: the service provider chooses a DNS name. I still find this so confusing. That's not your fault, though.

seh avatar May 30 '22 14:05 seh

Thanks @rifelpet, seems quite a few of these typos escaped me. I've cleaned them all up in a single commit.

hintofbasil avatar May 30 '22 21:05 hintofbasil

just wondering if it's ready to be merged? would like to have this feature in the default branch.

shoukoo avatar Jul 06 '22 03:07 shoukoo

Looks to me like the tests timed out while still progressing. The failed test is given 10 attempts, each taking up to 70 seconds, to call the load balancer but the test timed out after approximately two and a half minutes.

Going to try a rerun as I can't see a hard-coded timeout anywhere and I can't see how this is related to changes in the PR.

hintofbasil avatar Jul 14 '22 15:07 hintofbasil

/retest

hintofbasil avatar Jul 14 '22 15:07 hintofbasil

i'll finish the review this week. and we can seek to merge this post our upcoming v2.4.3 release.

M00nF1sh avatar Jul 14 '22 18:07 M00nF1sh

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 19 '22 14:12 k8s-triage-robot

/remove-lifecycle stale

z0rc avatar Dec 19 '22 15:12 z0rc

@hintofbasil - Do you have time/bandwidth to rebase this week? Need help? (Eager to try this) Thanks for merge!

gosunilgo avatar Feb 21 '23 20:02 gosunilgo

@hintofbasil - Do you have time/bandwidth to rebase this week? Need help? (Eager to try this) Thanks for merge!

Apologies. I didn't realise it had been so long since I looked at this.

I will try and put some time aside to rebase it this week.

We've been running this branch in production for some time now successfully but it would be great to get it merged upstream so we can get all the recent updates too.

hintofbasil avatar Feb 22 '23 12:02 hintofbasil

It looks like the cluster failed to delete because one of the node groups wasn't deleted. I can't see any reason in the logs for the node group to stick around.

Open to any ideas of where to look if anyone wanted to lend a hand.

hintofbasil avatar Mar 10 '23 12:03 hintofbasil

Going to try rerunning the tests in case the issue was a flake as it has that feel about it.

hintofbasil avatar Mar 15 '23 10:03 hintofbasil

/retest

hintofbasil avatar Mar 15 '23 10:03 hintofbasil

@hintofbasil I am following your PR closely and eager to try this. Hope you have time to finish this soon!

cgagnon-crakmn avatar Apr 27 '23 20:04 cgagnon-crakmn

I think that all the issues raised so far in this PR have now been addressed. If I have missed one then please raise it again.

The PR as a whole is ready for review again.

I will make sure to fix any merge conflicts more promptly and try my best to keep this in a review-ready state from now on.

hintofbasil avatar May 01 '23 13:05 hintofbasil

@hintofbasil I tested your PR and it creates the VPCES but deletion does not work. The LB controller is stuck with error:

error":"ResourceInUse: Load balancer 'arn:...' cannot be deleted because it is currently associated with another service\n\tstatus code: 400

The ES needs to be deleted first. Before that, probably all endpoints need to be rejected.

I also cannot see the endpoint service name. As it cannot be specified it would be better to attach it to the Service resource when the ES is created. Maybe as message in a condition.

trxa avatar May 15 '23 12:05 trxa

tested your PR and it creates the VPCES but deletion does not work. The LB controller is stuck with error:

error":"ResourceInUse: Load balancer 'arn:...' cannot be deleted because it is currently associated with another service\n\tstatus code: 400

The ES needs to be deleted first. Before that, probably all endpoints need to be rejected.

If you have connected an endpoint to your endpoint service then there is little I can do here as AWS doesn't let you delete an endpoint service with connected endpoints. As the endpoints are unknown to the controller these can not be automatically cleaned up. What would you imagine happening in this case?

If I've misunderstood the problem then let me know.

I also cannot see the endpoint service name. As it cannot be specified it would be better to attach it to the Service resource when the ES is created. Maybe as message in a condition.

I agree it would be useful to expose this somewhere. But to avoid the PR ballooning even more I would suggest this be added in the follow-up PR.

hintofbasil avatar May 16 '23 16:05 hintofbasil

You are right. Rejecting the connection is not sufficient to delete the service. Didn't realise it. Anyway the removal attempt for the ES should be done first and the operator should wait until it is possible and proceed then with the load balancer. It allows proper cleanup of landscapes without involving the AWS API.

trxa avatar May 17 '23 20:05 trxa