Ensure affinity groups are honored when VMs are deployed in parallel
Description
Fixes #7202 #9110
This PR doesn't completely solve the issue. To completely resolving the issue, we will have to sacrifice the speed at which VMs can be deployed in parallel.
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
- [ ] Major
- [x] Minor
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [ ] Major
- [x] Minor
- [ ] Trivial
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?
@blueorangutan package
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Codecov Report
Attention: Patch coverage is 11.59420% with 61 lines in your changes missing coverage. Please review.
Project coverage is 15.67%. Comparing base (
3f2761e) to head (120ad26). Report is 90 commits behind head on 4.19.
Additional details and impacted files
@@ Coverage Diff @@
## 4.19 #9201 +/- ##
============================================
+ Coverage 14.96% 15.67% +0.71%
- Complexity 10993 10997 +4
============================================
Files 5373 5010 -363
Lines 469248 439976 -29272
Branches 58782 54849 -3933
============================================
- Hits 70210 68967 -1243
+ Misses 391266 363377 -27889
+ Partials 7772 7632 -140
| Flag | Coverage Δ | |
|---|---|---|
| uitests | ? |
|
| unittests | 15.67% <11.59%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9842
@blueorangutan test keepEnv
@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests
[SF] Trillian test result (tid-10401) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 43304 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9201-t10401-kvm-centos7.zip Smoke tests completed. 130 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:
| Test | Result | Time (s) | Test File |
|---|---|---|---|
| test_01_redundant_vpc_site2site_vpn | Failure |
387.83 | test_vpc_vpn.py |
@vishesh92 have you tested it ?
@vishesh92 have you tested it ?
I tested with simulator. I was able to reproduce and fix the issue for strict anti affinity.
@vishesh92 have you tested it ?
I tested with simulator. I was able to reproduce and fix the issue for strict anti affinity.
thanks @vishesh92 I will create a testing env to test it
@vishesh92 I tested host-affinity,
command="cmk deploy virtualmachine zoneid=5a6557c9-b1e0-4fbf-bca4-7f376a96df65 templateid=344e6034-28ae-11ef-8462-1e00c200016f serviceofferingid=2eb32c51-d604-472b-895e-019f31a2f146 networkids=39323a57-1e12-4f3d-b030-36ef7e850e22"
affinitygroup=$(cmk create affinitygroup type='host affinity' name=test-`uuidgen`)
affinitygroupid=$(echo $affinitygroup |jq -r '.affinitygroup.id')
# create 2 vms in parallel
for i in `seq 1 2`;do $command affinitygroupids=$affinitygroupid & done
I tried twice. first time it works, second try it did not
@vishesh92 I tested host-affinity,
command="cmk deploy virtualmachine zoneid=5a6557c9-b1e0-4fbf-bca4-7f376a96df65 templateid=344e6034-28ae-11ef-8462-1e00c200016f serviceofferingid=2eb32c51-d604-472b-895e-019f31a2f146 networkids=39323a57-1e12-4f3d-b030-36ef7e850e22" affinitygroup=$(cmk create affinitygroup type='host affinity' name=test-`uuidgen`) affinitygroupid=$(echo $affinitygroup |jq -r '.affinitygroup.id') # create 2 vms in parallel for i in `seq 1 2`;do $command affinitygroupids=$affinitygroupid & doneI tried twice. first time it works, second try it did not
Yes. This will reduce the occurrence of failures but not completely resolve the issue.
@vishesh92 I had a look of the process of vm deployment, would it make sense to lock the method below ?
https://github.com/apache/cloudstack/blob/cb9b3134f7fff972b63d8565a4d021f8ea918903/server/src/main/java/com/cloud/deploy/DeploymentPlanningManagerImpl.java#L303-L304
rename it to planDeploymentInternal and create a new planDeployment using Transaction ?
@vishesh92 I had a look of the process of vm deployment, would it make sense to lock the method below ?
https://github.com/apache/cloudstack/blob/cb9b3134f7fff972b63d8565a4d021f8ea918903/server/src/main/java/com/cloud/deploy/DeploymentPlanningManagerImpl.java#L303-L304
rename it to
planDeploymentInternaland create a newplanDeploymentusing Transaction ?
I am also wondering what processes should be locked, maybe until the host is determined ?
maybe we could introduce some mechanism, for example
- when create a vm , lock the
networktable (for example, to fix #7907 , cc @abh1sar ) andresource_limittable (to fix the resource issue ?) - when start a vm, lock the
affinity_grouptable and some other tables if needed ?
the problem is, it could lead to high latency if the platform is large.
@vishesh92 can you review and advise further (PR marked waiting-for-author)?
@vishesh92 can you review and advise further (PR marked waiting-for-author)?
@rohityadavcloud I am running the capc e2e test for affinity group. Let me share the results once the job is complete.
@weizhouapache @sureshanaparti I ran e2e test 5 times on an env with this patch. Here are the results: 2 - Successfully passed 2- timed out 1 - failure
@sureshanaparti , will we try to merge this in 4.19.1?
@DaanHoogland This PR is targeted for 4.19.2