cloudstack Prevent infinite retries of autoscaling

Description

This PR fixes: https://github.com/apache/cloudstack/issues/9318

Types of changes

[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] New feature (non-breaking change which adds functionality)
[ ] Bug fix (non-breaking change which fixes an issue)
[ ] Enhancement (improves an existing feature and functionality)
[ ] Cleanup (Code refactoring and cleanup, that may add test cases)
[ ] build/CI
[ ] test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

[ ] Major
[ ] Minor

Bug Severity

[ ] BLOCKER
[ ] Critical
[ ] Major
[ ] Minor
[ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Aug 22 '24 21:08 Pearl1594

@weizhouapache would like some advice on this issue. Do you think that if the number of all VMs including those in error and stopped states >= max size then we should stop scaling any further Or do you think if there are VMs in error state we need to retry for a few iterations?

Aug 22 '24 21:08 Pearl1594

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 4.30%. Comparing base (6e6a276) to head (463d2a6). Report is 302 commits behind head on 4.19.

:exclamation: There is a different number of reports uploaded between BASE (6e6a276) and HEAD (463d2a6). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (6e6a276) HEAD (463d2a6)

unittests 1 0

Flag	BASE (6e6a276)	HEAD (463d2a6)
unittests	1	0

Additional details and impacted files

@@             Coverage Diff              @@
##               4.19   #9574       +/-   ##
============================================
- Coverage     15.08%   4.30%   -10.79%     
============================================
  Files          5406     366     -5040     
  Lines        472889   29514   -443375     
  Branches      57738    5162    -52576     
============================================
- Hits          71352    1270    -70082     
+ Misses       393593   28100   -365493     
+ Partials       7944     144     -7800

Flag	Coverage Δ
uitests	`4.30% <ø> (ø)`
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Aug 22 '24 22:08 codecov[bot]

@Pearl1594 I am ok with excluding the vms in Error state in the vm count. should we consider Stopped vms as available in the group ?

my other advice would be

add a setting for max retries (as @shwstppr advised)
change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried
users can change back to Enabled when the root cause issue is solved

cc @DaanHoogland @btzq

Jan 21 '25 14:01 weizhouapache

I added Stopped State imagining a scenario where in say for whatever reason a host goes down, all VMs belonging to the autoscale group, on that host would enter stopped state. This would cause additional VMs to be redeployed. I thought of it as a problematic scenario. But maybe that's how it should behave, I am not sure of the scope of autoscaling. Maybe you could shed some light on this @weizhouapache

Jan 21 '25 18:01 Pearl1594

@Pearl1594 I am ok with excluding the vms in Error state in the vm count. should we consider Stopped vms as available in the group ?

my other advice would be

add a setting for max retries (as @shwstppr advised)

change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried

users can change back to Enabled when the root cause issue is solved

cc @DaanHoogland @btzq

Yea i think that works. As a user, even if there is anything wrong with the VM, id still like it to be considered under the ASG Group. If not, i would not know i had a faulty VM as i would only find out by going through my list of VMs outside the ASG.

Jan 28 '25 10:01 btzq

closing this as https://github.com/apache/cloudstack/pull/11244 addresses it

Jul 25 '25 13:07 Pearl1594