cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Prevent infinite retries of autoscaling

Open Pearl1594 opened this issue 1 year ago • 2 comments

Description

This PR fixes: https://github.com/apache/cloudstack/issues/9318

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI
  • [ ] test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [ ] Major
  • [ ] Minor

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [ ] Major
  • [ ] Minor
  • [ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Pearl1594 avatar Aug 22 '24 21:08 Pearl1594

@weizhouapache would like some advice on this issue. Do you think that if the number of all VMs including those in error and stopped states >= max size then we should stop scaling any further Or do you think if there are VMs in error state we need to retry for a few iterations?

Pearl1594 avatar Aug 22 '24 21:08 Pearl1594

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 4.30%. Comparing base (6e6a276) to head (463d2a6). Report is 302 commits behind head on 4.19.

:exclamation: There is a different number of reports uploaded between BASE (6e6a276) and HEAD (463d2a6). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (6e6a276) HEAD (463d2a6)
unittests 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #9574       +/-   ##
============================================
- Coverage     15.08%   4.30%   -10.79%     
============================================
  Files          5406     366     -5040     
  Lines        472889   29514   -443375     
  Branches      57738    5162    -52576     
============================================
- Hits          71352    1270    -70082     
+ Misses       393593   28100   -365493     
+ Partials       7944     144     -7800     
Flag Coverage Δ
uitests 4.30% <ø> (ø)
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Aug 22 '24 22:08 codecov[bot]

@Pearl1594 I am ok with excluding the vms in Error state in the vm count. should we consider Stopped vms as available in the group ?

my other advice would be

  • add a setting for max retries (as @shwstppr advised)
  • change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried
  • users can change back to Enabled when the root cause issue is solved

cc @DaanHoogland @btzq

weizhouapache avatar Jan 21 '25 14:01 weizhouapache

I added Stopped State imagining a scenario where in say for whatever reason a host goes down, all VMs belonging to the autoscale group, on that host would enter stopped state. This would cause additional VMs to be redeployed. I thought of it as a problematic scenario. But maybe that's how it should behave, I am not sure of the scope of autoscaling. Maybe you could shed some light on this @weizhouapache

Pearl1594 avatar Jan 21 '25 18:01 Pearl1594

@Pearl1594 I am ok with excluding the vms in Error state in the vm count. should we consider Stopped vms as available in the group ?

my other advice would be

  • add a setting for max retries (as @shwstppr advised)
  • change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried
  • users can change back to Enabled when the root cause issue is solved

cc @DaanHoogland @btzq

Yea i think that works. As a user, even if there is anything wrong with the VM, id still like it to be considered under the ASG Group. If not, i would not know i had a faulty VM as i would only find out by going through my list of VMs outside the ASG.

btzq avatar Jan 28 '25 10:01 btzq

closing this as https://github.com/apache/cloudstack/pull/11244 addresses it

Pearl1594 avatar Jul 25 '25 13:07 Pearl1594