cluster-api-provider-azure icon indicating copy to clipboard operation
cluster-api-provider-azure copied to clipboard

Enable adding of Service Endpoints to subnets

Open mtougeron opened this issue 3 years ago • 24 comments

What type of PR is this?

/kind feature

What this PR does / why we need it:

This implements the ability to add Service Endpoints to the subnets created/managed by CAPZ.

Which issue(s) this PR fixes:

Fixes https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/2621

TODOs:

  • [x] squashed commits
  • [x] includes documentation
  • [x] adds unit tests

Release note:

Enables adding of Virtual Network Service Endpoints to subnets created/managed by CAPZ

mtougeron avatar Sep 07 '22 15:09 mtougeron

Hi @mtougeron. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 07 '22 15:09 k8s-ci-robot

Current outstanding TODOs:

  • [x] Write new docs on how to use this features along with limitations
  • [x] Refactor the subnet reconcile function so it can update existing subnets with service endpoint changes
  • [x] fix the api version conversion logic

mtougeron avatar Sep 07 '22 15:09 mtougeron

When running the tests with the api version conversion changes I'm getting the following error and I'm not sure how to address it. If someone could help point me in the right direction I'd really appreciate it. I'm not sure if I'm overlooking something obvious or not.

--- FAIL: TestFuzzyConversion (16.31s)
    --- FAIL: TestFuzzyConversion/for_AzureCluster (3.78s)
        --- FAIL: TestFuzzyConversion/for_AzureCluster/hub-spoke-hub (0.00s)
            conversion.go:253: 
                  &v1beta1.AzureCluster{
                  	TypeMeta: {},
                  	ObjectMeta: v1.ObjectMeta{
                  		... // 9 identical fields
                  		DeletionGracePeriodSeconds: nil,
                  		Labels:                     {},
                - 		Annotations:                nil,
                + 		Annotations:                map[string]string{},
                  		OwnerReferences:            nil,
                  		Finalizers:                 {"B)i"},
                  		... // 2 identical fields
                  	},
                  	Spec: v1beta1.AzureClusterSpec{
                  		AzureClusterClassSpec: {SubscriptionID: "嚧虩êſȏ&爲ş?ȩ暧", Location: `_瞊戸唨w镅ʜɸ撑擁\?ɂƯ`, IdentityRef: &{Kind: "撈弶瞯袮_v棈%!"(MISSING), Namespace: "ĕ鹗ʟ傠kʝT栎", Name: "ȏ鲾殳;Ɉ馲知郤俢", UID: "囹e±ɻ潒岀炜ƈ騸Ȱ1ă9ʔȣ:Ɖ", ...}, AzureEnvironment: "Ƌ烯!涧Ȋ;ũ焅bȪ曳i楁V", ...},
                  		NetworkSpec:           {Vnet: {ResourceGroup: "Ȕs閼Ɋs儼", ID: "权Dʖȭƺɥ", Name: "曠ȡIb$¡瘎&诈", Peerings: {{ResourceGroup: `俿7Ɉ隳曆"6?qʭĤ鱜霸`, VnetPeeringClassSpec: {RemoteVnetName: "9/"}}}, ...}, Subnets: {}, APIServerLB: {ID: "ŵ鋾", Name: "ƫ'惡抢(oQ;u鄣N充羽·ɝ稄ıį", FrontendIPs: {{Name: "ĘjWȊmǎ.嗎", FrontendIPClass: {PrivateIPAddress: "x{Ǚ,I<s唜Dz"}}}, LoadBalancerClassSpec: {SKU: `Ŕ湞\/w貅g允ąKȶ芀瘳`, Type: "8关+¤zȄw齴珦髋", IdleTimeoutInMinutes: &-672696477}}, NodeOutboundLB: &{ID: "鷬Ń落]Cȥƙ鷢", Name: "5%!a(MISSING)ȭ凌饨衆ɷȔ恢泂哐Ƅ裇\u00a0骀WG", FrontendIPs: {}, LoadBalancerClassSpec: {SKU: "夔.qMɃț诂禫ɣ埓ǥʂ慃rȓ±ĪNj", Type: "3t5ʥƞ穮,ɾƔdÄ"}}, ...},
                  		ResourceGroup:         "Bȓ馎@Bq辮",
                  		BastionSpec: v1beta1.BastionSpec{
                  			AzureBastion: &v1beta1.AzureBastion{
                  				Name: "_ť袴T槅ɤ#葐",
                  				Subnet: v1beta1.SubnetSpec{
                  					... // 3 identical fields
                  					RouteTable:       {ID: "蜭XlG骾鍓[ɹĥŚȴ-ű", Name: "蹄{愡ǹ="},
                  					NatGateway:       {NatGatewayIP: {Name: "{ȲŻǷWT魜İ", DNSName: "G隸级S M肏sȔƌ«笣o"}, NatGatewayClassSpec: {Name: "'ʜȾČDž"}},
                - 					ServiceEndpoints: v1beta1.ServiceEndpoints{{Service: "_姣"}},
                + 					ServiceEndpoints: nil,
                  					SubnetClassSpec:  {Role: "燌~×Ĉ菏ɚ1PÙ<qr", CIDRBlocks: {"ľu陽Rc癜Ȁ ʐ´G"}},
                  				},
                  				PublicIP: {Name: "騂綍Ô菦顂村", DNSName: "癊}冱B曘õʡ垑(儆进G", IPTags: {{Type: "@yjvǣǯ縵e", Tag: "fǡM苔"}}},
                  			},
                  		},
                  		ControlPlaneEndpoint: {Host: "Uw嬷k", Port: 1478785315},
                  	},
                  	Status: {Conditions: {}, LongRunningOperationStates: {}},
                  }
                
                Expected
                    <bool>: false
                to be true

mtougeron avatar Sep 07 '22 15:09 mtougeron

/ok-to-test

CecileRobertMichon avatar Sep 19 '22 17:09 CecileRobertMichon

Per the slack conversation, the sticking point I'm currently running into is this make generate error. Otherwise things seem to be working.

/Users/tougeron/src/go/src/github.com/mtougeron/cluster-api-provider-azure/hack/tools/bin/conversion-gen-v0.23.1 \
		--input-dirs=./exp/api/v1alpha3 \
		--output-file-base=zz_generated.conversion \
		--go-header-file=./hack/boilerplate/boilerplate.generatego.txt --output-base=/Users/tougeron/src/go/src/github.com/mtougeron/cluster-api-provider-azure
E0906 08:11:50.114503   66080 conversion.go:755] Warning: could not find nor generate a final Conversion function for sigs.k8s.io/cluster-api-provider-azure/exp/api/v1beta1.ManagedControlPlaneSubnet -> github.com/mtougeron/cluster-api-provider-azure/exp/api/v1alpha3.ManagedControlPlaneSubnet
E0906 08:11:50.114678   66080 conversion.go:756]   the following fields need manual conversion:
E0906 08:11:50.114686   66080 conversion.go:758]       - ServiceEndpoints

Also need to figure out a refactoring for:

azure/services/subnets/spec.go:60:1: cyclomatic complexity 36 of func `(*SubnetSpec).Parameters` is high (> 30) (gocyclo)
func (s *SubnetSpec) Parameters(existing interface{}) (parameters interface{}, err error) {

mtougeron avatar Sep 19 '22 22:09 mtougeron

The more detailed apidiff is

sigs.k8s.io/cluster-api-provider-azure/api/v1beta1
  Compatible changes:
  - ServiceEndpointSpec: added
  - ServiceEndpoints: added
  - SubnetSpec.ServiceEndpoints: added

sigs.k8s.io/cluster-api-provider-azure/azure/services/subnets
  Compatible changes:
  - SubnetSpec.ServiceEndpoints: added

sigs.k8s.io/cluster-api-provider-azure/exp/api/v1beta1
  Incompatible changes:
  - ManagedControlPlaneSubnet: old is comparable, new is not
  - ManagedControlPlaneVirtualNetwork: old is comparable, new is not
  Compatible changes:
  - ManagedControlPlaneSubnet.ServiceEndpoints: added

but I'm not sure why it works fine for the AzureCluster but not ManagedControlPlane.

mtougeron avatar Sep 23 '22 20:09 mtougeron

/retest

mtougeron avatar Sep 25 '22 13:09 mtougeron

@nojnhuh / @jackfrancis I'm not sure what the apidiff job is telling me is wrong. I know it's tell me something is wrong but I don't know how to track down what it is since the ManagedControlPlane changes are the same as the AzureCluster changes (as far as I can tell).

mtougeron avatar Sep 26 '22 20:09 mtougeron

This is about 50/50 spitball/educated guess, but I think apidiff is complaining that these changes remove the ability to meaningfully compare those types with == since ManagedControlPlaneSubnet only contained two strings before but now includes a slice. I imagine it's not complaining about AzureCluster because those types were already incomparable before these changes.

nojnhuh avatar Sep 26 '22 21:09 nojnhuh

@mtougeron we don't have a lot of E2E coverage around BYO VNET for AzureManagedCluster scenarios, are you confident in your own manual tests that this is feature complete and ready to use? code lgtm

I'll run through another round of manual testing today to be doubly sure. Is there any sort of logging I can provide that would make you feel confident in the testing as well?

mtougeron avatar Sep 28 '22 13:09 mtougeron

@jackfrancis I tested this again for BYO VNET and it doesn't make any changes (as expected/desired).

However, as I look closer, there isn't any validation done on the AzureManagedControlPlane.spec.virtualNetwork (and sub fields like subnet) in the exp/api/v1beta1/azuremanagedcontrolplane_webhook.go code. Would you like me to add that new logic now or should I wait for it to be refactored to be more like the api/v1beta1/azurecluster_validation.go code?

As this relates to this PR, it just ignores the new serviceEndpoints field with BYO VNET and doesn't report anything to the user. While this is not the expected experience, it is similar to what happens when a user sets other fields that are not used with BYO VNET.

mtougeron avatar Sep 28 '22 17:09 mtougeron

/retest

mtougeron avatar Sep 29 '22 21:09 mtougeron

/retest

mtougeron avatar Sep 29 '22 23:09 mtougeron

Apparently I created a regression with my last round of fixes since the pull-cluster-api-provider-azure-e2e test keeps failing. I'll figure out what's different in my manual testing vs the automated test suites and get that fixed.

mtougeron avatar Sep 30 '22 14:09 mtougeron

/test pull-cluster-api-provider-azure-e2e

mtougeron avatar Sep 30 '22 16:09 mtougeron

Ah, finally tracked down the failing log message.

E0930 16:32:16.616238       1 controller.go:326]  "msg"="Reconciler error" "error"="admission webhook \"validation.azurecluster.infrastructure.cluster.x-k8s.io\" denied the request: AzureCluster.infrastructure.cluster.x-k8s.io \"capz-e2e-i7wx2l-public-custom-vnet\" is invalid: [spec.networkSpec.subnets[0].CIDRBlocks: Invalid value: []string{\"10.0.0.0/24\"}: field is immutable, spec.networkSpec.subnets[1].CIDRBlocks: Invalid value: []string{\"10.0.1.0/24\"}: field is immutable]" "azureCluster"={"name":"capz-e2e-i7wx2l-public-custom-vnet","namespace":"capz-e2e-i7wx2l"} "controller"="azurecluster" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureCluster" "name"="capz-e2e-i7wx2l-public-custom-vnet" "namespace"="capz-e2e-i7wx2l" "reconcileID"="2afbcc3e-fa11-444f-9bf3-019cc147134f" 

Looks like the spec gets updated with the cidr blocks from the custom vnet but my checks for changes to those fields are preventing that. Previously the validation only did the checks if it was a managed vnet which is why it was passing previously. I'll update and push the fix

mtougeron avatar Sep 30 '22 18:09 mtougeron

/test pull-cluster-api-provider-azure-e2e

mtougeron avatar Oct 02 '22 03:10 mtougeron

Not sure what's up with prow but I'm going to try again.

/test pull-cluster-api-provider-azure-e2e

mtougeron avatar Oct 02 '22 14:10 mtougeron

Added the SSA flags for the new field

mtougeron avatar Oct 12 '22 18:10 mtougeron

Prow flake

/test pull-cluster-api-provider-azure-test

mtougeron avatar Oct 12 '22 18:10 mtougeron

@mtougeron thanks for your patience with reviews here. This is a big and complex change, but FWIW I think we're really close. Just have a few more questions from my side.

CecileRobertMichon avatar Oct 14 '22 00:10 CecileRobertMichon

@mtougeron thanks for your patience with reviews here. This is a big and complex change, but FWIW I think we're really close. Just have a few more questions from my side.

No problem at all. I've learned a heck of a lot in this one. It's been a ton of fun. :)

mtougeron avatar Oct 18 '22 15:10 mtougeron

@mtougeron: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-apidiff 369c7e47bf7b80b0c2e7c7e77b2bdde32f7b11b1 link false /test pull-cluster-api-provider-azure-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Oct 18 '22 15:10 k8s-ci-robot

/approve

jackfrancis avatar Oct 20 '22 01:10 jackfrancis

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Oct 20 '22 01:10 k8s-ci-robot