Prevent infinite retries of autoscaling
Description
This PR fixes: https://github.com/apache/cloudstack/issues/9318
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
- [ ] test (unit or integration test code)
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
- [ ] Major
- [ ] Minor
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [ ] Major
- [ ] Minor
- [ ] Trivial
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?
@weizhouapache would like some advice on this issue. Do you think that if the number of all VMs including those in error and stopped states >= max size then we should stop scaling any further Or do you think if there are VMs in error state we need to retry for a few iterations?
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 4.30%. Comparing base (
6e6a276) to head (463d2a6). Report is 302 commits behind head on 4.19.
:exclamation: There is a different number of reports uploaded between BASE (6e6a276) and HEAD (463d2a6). Click for more details.
HEAD has 1 upload less than BASE
Flag BASE (6e6a276) HEAD (463d2a6) unittests 1 0
Additional details and impacted files
@@ Coverage Diff @@
## 4.19 #9574 +/- ##
============================================
- Coverage 15.08% 4.30% -10.79%
============================================
Files 5406 366 -5040
Lines 472889 29514 -443375
Branches 57738 5162 -52576
============================================
- Hits 71352 1270 -70082
+ Misses 393593 28100 -365493
+ Partials 7944 144 -7800
| Flag | Coverage Δ | |
|---|---|---|
| uitests | 4.30% <ø> (ø) |
|
| unittests | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@Pearl1594 I am ok with excluding the vms in Error state in the vm count. should we consider Stopped vms as available in the group ?
my other advice would be
- add a setting for max retries (as @shwstppr advised)
- change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried
- users can change back to Enabled when the root cause issue is solved
cc @DaanHoogland @btzq
I added Stopped State imagining a scenario where in say for whatever reason a host goes down, all VMs belonging to the autoscale group, on that host would enter stopped state. This would cause additional VMs to be redeployed. I thought of it as a problematic scenario. But maybe that's how it should behave, I am not sure of the scope of autoscaling. Maybe you could shed some light on this @weizhouapache
@Pearl1594 I am ok with excluding the vms in Error state in the vm count. should we consider Stopped vms as available in the group ?
my other advice would be
- add a setting for max retries (as @shwstppr advised)
- change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried
- users can change back to Enabled when the root cause issue is solved
cc @DaanHoogland @btzq
Yea i think that works. As a user, even if there is anything wrong with the VM, id still like it to be considered under the ASG Group. If not, i would not know i had a faulty VM as i would only find out by going through my list of VMs outside the ASG.
closing this as https://github.com/apache/cloudstack/pull/11244 addresses it