cluster-api-provider-vsphere icon indicating copy to clipboard operation
cluster-api-provider-vsphere copied to clipboard

✨ Add validating and mutating webhook for supervisor mode

Open zhanggbj opened this issue 1 year ago • 34 comments
trafficstars

What this PR does / why we need it:

Add validating and mutating webhook for supervisor mode:

- Refactor supervisor config folder and generate webhook config and manifest.
- Enable update validating for vm-operator VSphereMachine to block in-place update of following fields.
	- ImageName
	- ClassName
	- StorageClass
	- MinHardwareVersion

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #2595

zhanggbj avatar Jan 23 '24 09:01 zhanggbj

Codecov Report

Attention: Patch coverage is 72.72727% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 64.07%. Comparing base (b89c791) to head (4ec4228). Report is 2 commits behind head on main.

:exclamation: Current head 4ec4228 differs from pull request most recent head e910494. Consider uploading reports for the commit e910494 to get more accurate results

Files Patch % Lines
internal/webhooks/vmoperator/vspheremachine.go 72.09% 10 Missing and 2 partials :warning:
internal/test/helpers/envtest.go 25.00% 2 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2651      +/-   ##
==========================================
- Coverage   65.48%   64.07%   -1.42%     
==========================================
  Files         161      161              
  Lines        7525     9391    +1866     
==========================================
+ Hits         4928     6017    +1089     
- Misses       2134     2914     +780     
+ Partials      463      460       -3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jan 23 '24 09:01 codecov[bot]

It's ready for a review. And I'll try to add some tests if possible to make codecov happy.

zhanggbj avatar Jan 24 '24 05:01 zhanggbj

It's ready for a review. And I'll try to add some tests if possible to make codecov happy.

Thanks, please ignore codecov, its not failing because of your PR, its the usual codecov flakyness when uploading the report 😭

chrischdi avatar Jan 24 '24 12:01 chrischdi

All comments addressed, it's ready for another review :-)

zhanggbj avatar Jan 25 '24 08:01 zhanggbj

/test pull-cluster-api-provider-vsphere-test-main

zhanggbj avatar Jan 25 '24 08:01 zhanggbj

Once we have the manifests working. Let's please do one manual test with tilt/kind or something (with a real Kubernetes cluster not just unit tests) to verify the webhooks are really active. It's very easy to make a mistake somewhere (e.g. incorrect / missing webhook manifests, incorrect resource match in the webhook config, ...)

sbueringer avatar Jan 25 '24 17:01 sbueringer

Once we have the manifests working. Let's please do one manual test with tilt/kind or something (with a real Kubernetes cluster not just unit tests) to verify the webhooks are really active. It's very easy to make a mistake somewhere (e.g. incorrect / missing webhook manifests, incorrect resource match in the webhook config, ...)

Make sense to me, will run tilt to verify it and update result later.

zhanggbj avatar Jan 29 '24 07:01 zhanggbj

/test pull-cluster-api-provider-vsphere-test-integration-main

zhanggbj avatar Feb 04 '24 08:02 zhanggbj

cc @zhanggbj (just in case you forgot about this one :))

sbueringer avatar Apr 09 '24 12:04 sbueringer

I'm testing locally with tilt for supervisor mode, but I think it's ready for an initial review while waiting for the test result.

zhanggbj avatar Apr 24 '24 04:04 zhanggbj

Deployed a supervisor mode env with tilt and the webhook configs and manifests works fine. Please let me know if any others need to validate :-)

  1. Checked webhook configs of capv-supervisor-mutating-webhook-configuration and capv-supervisor-validating-webhook-configuration.

  2. Updating imageName of vspheremachine will be blocked as expected.

# vspheremachines.vmware.infrastructure.cluster.x-k8s.io "cluster2-67cn6" was not valid:
# * spec.imageName: Invalid value: v1beta1.VSphereMachineSpec{ProviderID:(*string)(0x40005a4e40), FailureDomain:(*string)(nil), ImageName:"ubuntu-2204-kube-v1.29.0", ClassName:"vcsim-default-vm-class", StorageClass:"vcsim-default-storage-class", Volumes:[]v1beta1.VSphereMachineVolume(nil), PowerOffMode:"hard", MinHardwareVersion:""}: updating imageName is not allowed

CC @sbueringer @chrischdi

zhanggbj avatar May 08 '24 09:05 zhanggbj

/test pull-cluster-api-provider-vsphere-e2e-supervisor-main /test pull-cluster-api-provider-vsphere-e2e-govmomi-main

sbueringer avatar May 15 '24 11:05 sbueringer

We also need to change the path here:

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/5e47949a28d0da488b0c2ba37e8c0a175cc2ceb6/internal/test/helpers/webhook.go#L86

We may also need the similar thing for the supervisor then?!

Perhaps we need to add them here controllers/vmware/test?

zhanggbj avatar May 15 '24 15:05 zhanggbj

@zhanggbj: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command pull-cluster-api-provider-vsphere-test-integration-main 4ec4228 link true /test pull-cluster-api-provider-vsphere-test-integration-main Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Interesting that Prow shows us a job run from February :) https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-vsphere/2651/pull-cluster-api-provider-vsphere-test-integration-main/1759866415801700352

(we deleted the job since then, so let's ignore)

sbueringer avatar May 16 '24 09:05 sbueringer

/lgtm

sbueringer avatar May 16 '24 09:05 sbueringer

LGTM label has been added.

Git tree hash: 0cbe2e1a85f3451505df183f9a362f7067af0325

k8s-ci-robot avatar May 16 '24 09:05 k8s-ci-robot

/test pull-cluster-api-provider-vsphere-e2e-govmomi-main /test pull-cluster-api-provider-vsphere-e2e-supervisor-main

sbueringer avatar May 16 '24 09:05 sbueringer

/test pull-cluster-api-provider-vsphere-e2e-govmomi-main

sbueringer avatar May 16 '24 10:05 sbueringer

This is nice :-)

/approve

chrischdi avatar May 16 '24 10:05 chrischdi

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrischdi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar May 16 '24 10:05 k8s-ci-robot

/hold

for some manual testing

sbueringer avatar May 16 '24 11:05 sbueringer

The govmomi test now failed 2 times in a row. I'm starting to think we broke something

/test pull-cluster-api-provider-vsphere-e2e-govmomi-main

sbueringer avatar May 16 '24 11:05 sbueringer

@zhanggbj: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-vsphere-test-integration-main 4ec4228ae5dd95aef2ec935889983c74f887c470 link true /test pull-cluster-api-provider-vsphere-test-integration-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar May 16 '24 11:05 k8s-ci-robot

Okay 3 times in a row seems too often

sbueringer avatar May 16 '24 15:05 sbueringer

Checked the test artifacts, the failed cluster is quick-start-3x2kmb and it's blocked at WaitForOneKubeadmControlPlaneMachineToExist, which seems blocked due to no ready node. From the kubelet.log, the latest error is CNI not ready as below, I cannot think of any relationship with this PR change🧐

May 16 11:17:54.628315 quick-start-3x2kmb-7q8wn kubelet[1552]: E0516 11:17:54.628259    1552 kubelet.go:2900] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

zhanggbj avatar May 17 '24 07:05 zhanggbj

/test pull-cluster-api-provider-vsphere-e2e-govmomi-main

sbueringer avatar May 17 '24 07:05 sbueringer

Same, but the job is 100% green on main. Let's see if it fails again. We might have to figure out why exactly the job is failing

sbueringer avatar May 17 '24 07:05 sbueringer

The one at pull-cluster-api-provider-vsphere-e2e-govmomi-main was in fact a flake. We also have that on release branches.

xref triage

chrischdi avatar May 17 '24 07:05 chrischdi

Interesting... the flakes all failed at /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/controlplane_helpers.go:153

The one at pull-cluster-api-provider-vsphere-e2e-govmomi-main was in fact a flake. We also have that on release branches.

xref triage

zhanggbj avatar May 17 '24 09:05 zhanggbj

Kk. Maybe just bad luck :)

sbueringer avatar May 17 '24 11:05 sbueringer