cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Ensure affinity groups are honored when VMs are deployed in parallel

Open vishesh92 opened this issue 1 year ago • 19 comments

Description

Fixes #7202 #9110

This PR doesn't completely solve the issue. To completely resolving the issue, we will have to sacrifice the speed at which VMs can be deployed in parallel.

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [ ] Major
  • [x] Minor

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [ ] Major
  • [x] Minor
  • [ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

vishesh92 avatar Jun 10 '24 12:06 vishesh92

@blueorangutan package

vishesh92 avatar Jun 10 '24 12:06 vishesh92

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 10 '24 12:06 blueorangutan

Codecov Report

Attention: Patch coverage is 11.59420% with 61 lines in your changes missing coverage. Please review.

Project coverage is 15.67%. Comparing base (3f2761e) to head (120ad26). Report is 90 commits behind head on 4.19.

Files Patch % Lines
...cloudstack/affinity/HostAntiAffinityProcessor.java 0.00% 43 Missing :warning:
...che/cloudstack/affinity/HostAffinityProcessor.java 44.44% 9 Missing and 1 partial :warning:
.../cloudstack/affinity/dao/AffinityGroupDaoImpl.java 0.00% 8 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19    #9201      +/-   ##
============================================
+ Coverage     14.96%   15.67%   +0.71%     
- Complexity    10993    10997       +4     
============================================
  Files          5373     5010     -363     
  Lines        469248   439976   -29272     
  Branches      58782    54849    -3933     
============================================
- Hits          70210    68967    -1243     
+ Misses       391266   363377   -27889     
+ Partials       7772     7632     -140     
Flag Coverage Δ
uitests ?
unittests 15.67% <11.59%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jun 10 '24 13:06 codecov[bot]

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9842

blueorangutan avatar Jun 10 '24 13:06 blueorangutan

@blueorangutan test keepEnv

vishesh92 avatar Jun 10 '24 15:06 vishesh92

@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan avatar Jun 10 '24 15:06 blueorangutan

[SF] Trillian test result (tid-10401) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 43304 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9201-t10401-kvm-centos7.zip Smoke tests completed. 130 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_redundant_vpc_site2site_vpn Failure 387.83 test_vpc_vpn.py

blueorangutan avatar Jun 11 '24 03:06 blueorangutan

@vishesh92 have you tested it ?

weizhouapache avatar Jun 12 '24 10:06 weizhouapache

@vishesh92 have you tested it ?

I tested with simulator. I was able to reproduce and fix the issue for strict anti affinity.

vishesh92 avatar Jun 12 '24 10:06 vishesh92

@vishesh92 have you tested it ?

I tested with simulator. I was able to reproduce and fix the issue for strict anti affinity.

thanks @vishesh92 I will create a testing env to test it

weizhouapache avatar Jun 12 '24 11:06 weizhouapache

@vishesh92 I tested host-affinity,

    command="cmk deploy virtualmachine zoneid=5a6557c9-b1e0-4fbf-bca4-7f376a96df65 templateid=344e6034-28ae-11ef-8462-1e00c200016f serviceofferingid=2eb32c51-d604-472b-895e-019f31a2f146 networkids=39323a57-1e12-4f3d-b030-36ef7e850e22"

    affinitygroup=$(cmk create affinitygroup type='host affinity' name=test-`uuidgen`)
    affinitygroupid=$(echo $affinitygroup |jq -r '.affinitygroup.id')

    # create 2 vms in parallel
    for i in `seq 1 2`;do $command affinitygroupids=$affinitygroupid & done

I tried twice. first time it works, second try it did not

image

weizhouapache avatar Jun 12 '24 13:06 weizhouapache

@vishesh92 I tested host-affinity,

    command="cmk deploy virtualmachine zoneid=5a6557c9-b1e0-4fbf-bca4-7f376a96df65 templateid=344e6034-28ae-11ef-8462-1e00c200016f serviceofferingid=2eb32c51-d604-472b-895e-019f31a2f146 networkids=39323a57-1e12-4f3d-b030-36ef7e850e22"

    affinitygroup=$(cmk create affinitygroup type='host affinity' name=test-`uuidgen`)
    affinitygroupid=$(echo $affinitygroup |jq -r '.affinitygroup.id')

    # create 2 vms in parallel
    for i in `seq 1 2`;do $command affinitygroupids=$affinitygroupid & done

I tried twice. first time it works, second try it did not

Yes. This will reduce the occurrence of failures but not completely resolve the issue.

vishesh92 avatar Jun 12 '24 13:06 vishesh92

@vishesh92 I had a look of the process of vm deployment, would it make sense to lock the method below ?

https://github.com/apache/cloudstack/blob/cb9b3134f7fff972b63d8565a4d021f8ea918903/server/src/main/java/com/cloud/deploy/DeploymentPlanningManagerImpl.java#L303-L304

rename it to planDeploymentInternal and create a new planDeployment using Transaction ?

weizhouapache avatar Jun 14 '24 13:06 weizhouapache

@vishesh92 I had a look of the process of vm deployment, would it make sense to lock the method below ?

https://github.com/apache/cloudstack/blob/cb9b3134f7fff972b63d8565a4d021f8ea918903/server/src/main/java/com/cloud/deploy/DeploymentPlanningManagerImpl.java#L303-L304

rename it to planDeploymentInternal and create a new planDeployment using Transaction ?

I am also wondering what processes should be locked, maybe until the host is determined ?

maybe we could introduce some mechanism, for example

  • when create a vm , lock the network table (for example, to fix #7907 , cc @abh1sar ) and resource_limit table (to fix the resource issue ?)
  • when start a vm, lock the affinity_group table and some other tables if needed ?

the problem is, it could lead to high latency if the platform is large.

weizhouapache avatar Jun 14 '24 14:06 weizhouapache

@vishesh92 can you review and advise further (PR marked waiting-for-author)?

rohityadavcloud avatar Jun 26 '24 09:06 rohityadavcloud

@vishesh92 can you review and advise further (PR marked waiting-for-author)?

@rohityadavcloud I am running the capc e2e test for affinity group. Let me share the results once the job is complete.

vishesh92 avatar Jun 26 '24 09:06 vishesh92

@weizhouapache @sureshanaparti I ran e2e test 5 times on an env with this patch. Here are the results: 2 - Successfully passed 2- timed out 1 - failure

vishesh92 avatar Jun 26 '24 12:06 vishesh92

@sureshanaparti , will we try to merge this in 4.19.1?

DaanHoogland avatar Jul 01 '24 08:07 DaanHoogland

@DaanHoogland This PR is targeted for 4.19.2

vishesh92 avatar Jul 01 '24 08:07 vishesh92