cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

VR: remove old json config when start vmware/xenserver VPC VRs

Open weizhouapache opened this issue 3 years ago • 123 comments
trafficstars

Description

This PR fixes the issue that the IPs are associated to wrong interfaces when reboot a VR on vmware/xen environment. However, VR will be broken if it is rebooted in vCenter or XenCenter (not in cloudstack).

steps to reproduce the issue (1) create vpc, and a vpc tier (2) acquire an ip in additional ip range, enable static nat or pf/lb (3) reboot VR in cloudstack

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [ ] Major
  • [ ] Minor

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [x] Major
  • [ ] Minor
  • [ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

weizhouapache avatar Feb 04 '22 18:02 weizhouapache

@blueorangutan package

weizhouapache avatar Feb 04 '22 18:02 weizhouapache

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan avatar Feb 04 '22 18:02 blueorangutan

Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 2465

blueorangutan avatar Feb 04 '22 19:02 blueorangutan

@sureshanaparti should this be considered on 4.16.1?

nvazquez avatar Feb 05 '22 17:02 nvazquez

@blueorangutan test centos7 xcpng82

weizhouapache avatar Feb 07 '22 07:02 weizhouapache

@weizhouapache a Trillian-Jenkins test job (centos7 mgmt + xcpng82) has been kicked to run smoke tests

blueorangutan avatar Feb 07 '22 07:02 blueorangutan

@weizhouapache doesn't this beat the whole idea of persistent configs in VRs? cc @sureshanaparti @DaanHoogland

rohityadavcloud avatar Feb 07 '22 09:02 rohityadavcloud

@weizhouapache doesn't this beat the whole idea of persistent configs in VRs? cc @sureshanaparti @DaanHoogland

@rohityadavcloud

each time when VR is started from cloudstack, the config files are regenerated. so it is not required in this scenario. I understand persistent config is helpful when reboot VR in vcenter or xencenter. However, it breaks VPC VR when reboot it in cloudstack in many scenarios.

ps: this behaviour has been aplied for vpc vrs on kvm hosts, where there is no centrailized management other than cloudstack.

weizhouapache avatar Feb 07 '22 11:02 weizhouapache

@blueorangutan test centos7 vmware-7u2

DaanHoogland avatar Feb 07 '22 13:02 DaanHoogland

@DaanHoogland unsupported parameters provided. Supported mgmt server os are: suse15, centos7, centos6, alma8, ubuntu18, ubuntu20, rocky8. Supported hypervisors are: kvm-centos6, kvm-centos7, kvm-rocky8, kvm-alma8, kvm-ubuntu18, kvm-ubuntu20, kvm-suse15, vmware-55u3, vmware-60u2, vmware-65u2, vmware-67u3, vmware-70u1, vmware-70u2, vmware-70u3, xenserver-65sp1, xenserver-71, xenserver-74, xcpng74, xcpng76, xcpng80, xcpng81, xcpng82

blueorangutan avatar Feb 07 '22 13:02 blueorangutan

@blueorangutan test centos7 vmware-70u2

DaanHoogland avatar Feb 07 '22 13:02 DaanHoogland

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + vmware-70u2) has been kicked to run smoke tests

blueorangutan avatar Feb 07 '22 13:02 blueorangutan

Trillian test result (tid-3196) Environment: xcpng82 (x2), Advanced Networking with Mgmt server 7 Total time taken: 48954 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5938-t3196-xcpng82.zip Smoke tests completed. 91 look OK, 1 have errors Only failed tests results shown below:

Test Result Time (s) Test File
test_01_sys_vm_start Failure 0.10 test_secondary_storage.py

blueorangutan avatar Feb 07 '22 21:02 blueorangutan

Trillian test result (tid-3212) Environment: vmware-70u2 (x2), Advanced Networking with Mgmt server 7 Total time taken: 35443 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5938-t3212-vmware-70u2.zip Smoke tests completed. 92 look OK, 0 have errors Only failed tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Feb 08 '22 00:02 blueorangutan

Trillian test result (tid-3196) Environment: xcpng82 (x2), Advanced Networking with Mgmt server 7 Total time taken: 48954 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5938-t3196-xcpng82.zip Smoke tests completed. 91 look OK, 1 have errors Only failed tests results shown below: Test Result Time (s) Test File test_01_sys_vm_start Failure 0.10 test_secondary_storage.py

@weizhouapache is this expected/intermittent? can you have a look?

DaanHoogland avatar Feb 08 '22 08:02 DaanHoogland

test_secondary_storage.py

@DaanHoogland It should not be related to this PR. I have seen it some times before. I will have a look

weizhouapache avatar Feb 08 '22 08:02 weizhouapache

@DaanHoogland @rohityadavcloud @sureshanaparti this has conflicts with persistent config which is useful when VR is rebooted from out-of-band (e.g. vcenter, or command inside VR). However, the nics of VPC VR is always plugged in the following order when VPC VR is started in cloudstack (1) Source nat IP (2) additional public IPs (3) private gateway (4) vpc tiers This order is sometimes different from the IPs in json files inside the VR. This happens in many scenarios, for example (1) public ip in additional range is associated (2) private gateway is created after vpc tier creation (3) remove a vpc tier (not the last vpc tier)

when it happens, IPs will be associated to wrong interfaces when reboot VR from cloudstack.

with this PR, cloudstack can ensure that the order is correct and Ips are associated to correct interfaces. but the VR is rebooted from out-of-band, VR will not work anymore, as json config files are removed in bootstrap.

We need to determine which we should support better (reboot VR from cloudstack, or out-of-band).

weizhouapache avatar Feb 08 '22 10:02 weizhouapache

a feasible improvement is : remove json file only it is a VPC VR, so network VRs will not be impacted

weizhouapache avatar Feb 08 '22 10:02 weizhouapache

@sureshanaparti @weizhouapache I think we should investigate more if this can be solved in a non conflicting way for both Cloudstack controlled and out-of-band reboots. I suggest moving this to milestone 4.17

DaanHoogland avatar Feb 08 '22 12:02 DaanHoogland

@blueorangutan package

weizhouapache avatar Mar 10 '22 13:03 weizhouapache

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan avatar Mar 10 '22 13:03 blueorangutan

Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 2833

blueorangutan avatar Mar 10 '22 14:03 blueorangutan

@blueorangutan test matrix

weizhouapache avatar Mar 10 '22 14:03 weizhouapache

@weizhouapache a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan avatar Mar 10 '22 14:03 blueorangutan

Trillian test result (tid-3561) Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7 Total time taken: 33673 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5938-t3561-xenserver-71.zip Smoke tests completed. 92 look OK, 0 have errors Only failed tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Mar 11 '22 00:03 blueorangutan

@weizhouapache is this ready for review or still needs more work?

nvazquez avatar Mar 11 '22 03:03 nvazquez

@nvazquez this code is ready for review and testing. This requires manual testing on rebooting VR from inside it (or out-of-band).

I am working on fixing component tests.

weizhouapache avatar Mar 11 '22 08:03 weizhouapache

clgtm, seems to do what it says on the tin. One question; for vmware we query and get only one vm and reconfigure it, for Xen we get a list and iterate over it. Is this difference real or just a result of the API definitions, i.e. can there exist more then one on xen?

sorry @DaanHoogland , can you clarify the question ? the process to pass cmdline to VRs is different on hypervisors.

weizhouapache avatar Mar 14 '22 09:03 weizhouapache

@blueorangutan package

weizhouapache avatar Mar 14 '22 09:03 weizhouapache

@weizhouapache a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Mar 14 '22 09:03 blueorangutan