network-controller-harvester icon indicating copy to clipboard operation
network-controller-harvester copied to clipboard

Use VMCache instead of VMICache to judge if the NAD is in use

Open mingshuoqiu opened this issue 1 year ago • 14 comments

The VMI status will be none if the VM is turned off from the GUI. The VMI information can't not be relied to make sure if particular NAD is attched to any VM or not. Use VMCache instead to prevent from deleting the NAD if the VM is turned off.

Problem: Fail to delete an Off VM if its network was deleted. The VM network can be deleted if the attached VM is off. It will be a problem if user try to turn the VM back to on.

Solution: Do not allow VM network removing if the VM still exists, even it is OFF.

Related Issue: https://github.com/harvester/harvester/issues/6961

Test plan:

  1. Create a VM network
  2. Create a VM and attached to the created VM network
  3. Turn the VM from Running to Off.
  4. Remove the VM network. Should not be allowed.

mingshuoqiu avatar Dec 04 '24 04:12 mingshuoqiu

@mingshuoqiu can you also add steps in Testplan for verifying deletion of vlan config as well with stopped vm.

rrajendran17 avatar Dec 04 '24 19:12 rrajendran17

Thanks @mingshuoqiu for taking care of the comments. Overall LGTM, except few minor comments.Can you please validate the code changes for nad deletion, vlan config deletion with vms stopped scenarios and add the steps to testplan also.Thanks.

rrajendran17 avatar Dec 17 '24 20:12 rrajendran17

Need to re-implement the WhoUseNad since the indexeres.VMByNetworkIndex has been removed from harvester v1.4. Will replace the WhoUseNad by different logic.

mingshuoqiu avatar Dec 30 '24 15:12 mingshuoqiu

I have a general question regarding the fix: Do we still need the VmiGetter? Under what circumstances do we rely on VMIs to check whether an action is allowed or not? Maybe the logic in the WhoUseNad for VMI and VM is redundant, and we can completely replace VMI with VM. Please advise. Thank you.

I tried to replace all VMI handler to VM. But I will get the log from the webhook pod. How should I add the resource to the access list? CRD? @starbops

W0116 09:04:53.736063       1 reflector.go:547] github.com/rancher/lasso/pkg/cache/cache.go:145: failed to list *v1.VirtualMachine: virtualmachines.kubevirt.io is forbidden: User "system:serviceaccount:harvester-system:harvester-network-webhook" cannot list resource "virtualmachines" in API group "kubevirt.io" at the cluster scope
E0116 09:04:53.736170       1 reflector.go:150] github.com/rancher/lasso/pkg/cache/cache.go:145: Failed to watch *v1.VirtualMachine: failed to list *v1.VirtualMachine: virtualmachines.kubevirt.io is forbidden: User "system:serviceaccount:harvester-system:harvester-network-webhook" cannot list resource "virtualmachines" in API group "kubevirt.io" at the cluster scope

mingshuoqiu avatar Jan 17 '25 03:01 mingshuoqiu

@mingshuoqiu Also please add a step to testcase for NAD update scenario and it should not be allowed on non-running VMs as well.Thanks

rrajendran17 avatar Feb 22 '25 00:02 rrajendran17

@mingshuoqiu What's the current status & plan of this PR? Will it be merged on v1.5.0? thanks.

The https://github.com/harvester/network-controller-harvester/pull/149 is better to on top of this PR, thus it can have an unified check on NAD, VM, VMI, storagenetwork.

w13915984028 avatar Feb 24 '25 13:02 w13915984028

harvester-network-webhook...

@mingshuoqiu Need to add vm to the clusterrole

https://github.com/harvester/charts/blob/761090116ddb657e4850308ebb551cf44c67f43f/charts/harvester-network-controller/templates/rbac.yaml#L124

w13915984028 avatar Feb 24 '25 13:02 w13915984028

I'm verifying with the updated rbac. Will update when verification done.

mingshuoqiu avatar Feb 25 '25 03:02 mingshuoqiu

BTW:

In this commit https://github.com/harvester/network-controller-harvester/pull/149/commits/9f06c5475073e2b82e67298efcbee9dea32b8f80#diff-900397cb9002ed992e8565ae266dc4ec231a187b1d3cf016f5fa0cd6ebf82d04, I add two additional checks on nad webhook:

checkClusterNetwork: nad can't refer to a none-existing clusternetwork

checkClusterNetworkUnchanged: nad can't change its referring clusternetwork

In this commit https://github.com/harvester/network-controller-harvester/pull/149/commits/7dfe2d6f3afed47d016c09ad2f23d4bfcd39e4f0#diff-900397cb9002ed992e8565ae266dc4ec231a187b1d3cf016f5fa0cd6ebf82d04

checkStorageNetwork: storagenetwork nad can't be changed or deleted by user

w13915984028 avatar Feb 25 '25 15:02 w13915984028

More about the scenarios:

In current stage, the most import param of NAD is VID.

(1) Currently, NAD is allowed to be changed, as long as all attached VMIs are stopped.

If we force the VM should also remove the referrences to NAD before NAD change, it is a break change. The broken scenarios are like:

(a) User migrates a group of VMs from VID 100 to VID 200 due to network re-plan; or change the L3 IP subnets for this VID.

(b) User creates a dedicated NAD for network testing, and creates a couple of VMs attaching to this NAD. User plans to open a new VID 300, stop this group of VMs, change NAD, start the VMs, then test the new VID. User can use this groups of VMs to test any existing VIDs at any time.

(c) ...

(2) Change the BrName (clusternetwork) of NAD.

This is a bug on Harvester now, it can be done if (1) is met, but, all the related VMs are not reflected. Or I missed something, Harvester has code to care it. Please remind me if you know.

Luckily, we did not receive related bug reports now, if it is an real bug.

We need to:

(i) Enhance (1) via this PR https://github.com/harvester/network-controller-harvester/pull/131, add additional check on VM when NAD is deleted.

Be very careful, do NOT import break change, VM is the core feature of Harvester.

(ii) Enhance (2) via issue https://github.com/harvester/harvester/issues/7701 and related PR, deny the change of BrName (clusternetwork).


But IMO, we should allow such updates on NADs which is not referenced by any of the VMs.

@rrajendran17 There are race cases, create VM & change NAD happen in parallel, it may still cause some issues.

When a NAD is attached to wrong clusternetwork (very rarely), delete & recreate is not so unaccetable. We may assume this is a corner case, and add this limitation to the document.

If some user really needs this, we can try to support it later.

w13915984028 avatar Feb 25 '25 19:02 w13915984028

FYI:

When deleting a NAD, alone with vmi, vm is also checked. This is done via https://github.com/harvester/network-controller-harvester/pull/149/commits/97eadd5c42b08dc6c431df2f16c7514c41fe0774 to cooprate with other functions. The test code is also added.

https://github.com/harvester/charts/pull/339 is also filed to update rbac.

w13915984028 avatar Feb 27 '25 14:02 w13915984028

@w13915984028 Could you help review again? Thanks

mingshuoqiu avatar Mar 13 '25 03:03 mingshuoqiu

@mingshuoqiu This depends on the release strategy by @bk201 @starbops

For issue https://github.com/harvester/harvester/issues/6961, this PR https://github.com/harvester/network-controller-harvester/pull/131 can prevent new delete vm operations; for those existing dangling VMs, the error is still there.

For Harvester v1.5.0, if we want to add a fix, I would recommend https://github.com/harvester/harvester/pull/7743, it fixes the root cause of issue https://github.com/harvester/harvester/issues/6961. Any exising VMs with dangling NADs can be deleted successfully.

Discussed with @bk201, he prefers to add this PR to v1.5.1/v1.6.0.

For v1.5.1/v1.6.0, check VM when deleting NAD is already a part of https://github.com/harvester/network-controller-harvester/pull/149, and there are more APIs and test codes.

Please @bk201 @starbops take a new look, thanks.

w13915984028 avatar Mar 13 '25 06:03 w13915984028

let's do it in 1.5.1, thanks!

bk201 avatar Mar 13 '25 07:03 bk201