cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

CheckOnHostCommand: add missing timeout setting

Open rp- opened this issue 1 year ago • 8 comments

Description

The new CheckOnHostCommand constructor was missing a reasonable timeout value, which meant it would fallback to the wait (1800s) timeout. On a Linstor cluster this resulted in over 15 minutes wait time until a host was recognized as down. With timeout of 20s (as the other constructor) it takes 4-5 mins for a host to become recognized as down.

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI
  • [ ] test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [ ] Major
  • [ ] Minor

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [ ] Major
  • [x] Minor
  • [ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Failover tests (force shutdown of a host) in a Linstor cluster.

How did you try to break this feature and the system with this change?

rp- avatar Sep 13 '24 12:09 rp-

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 15.11%. Comparing base (a0932b0) to head (eca66f8). Report is 95 commits behind head on 4.19.

Files with missing lines Patch % Lines
...n/java/com/cloud/agent/api/CheckOnHostCommand.java 0.00% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19    #9677      +/-   ##
============================================
+ Coverage     15.08%   15.11%   +0.02%     
+ Complexity    11192    11190       -2     
============================================
  Files          5406     5406              
  Lines        473215   473214       -1     
  Branches      61680    58585    -3095     
============================================
+ Hits          71386    71521     +135     
- Misses       393880   393883       +3     
+ Partials       7949     7810     -139     
Flag Coverage Δ
uitests 4.76% <ø> (+0.46%) :arrow_up:
unittests 15.80% <0.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Sep 13 '24 12:09 codecov[bot]

@blueorangutan package

weizhouapache avatar Sep 18 '24 08:09 weizhouapache

@blueorangutan package

sureshanaparti avatar Sep 20 '24 12:09 sureshanaparti

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Sep 20 '24 12:09 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11163

blueorangutan avatar Sep 20 '24 13:09 blueorangutan

@blueorangutan package

rohityadavcloud avatar Oct 17 '24 09:10 rohityadavcloud

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Oct 17 '24 09:10 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11374

blueorangutan avatar Oct 17 '24 10:10 blueorangutan

@blueorangutan test

DaanHoogland avatar Oct 28 '24 14:10 DaanHoogland

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan avatar Oct 28 '24 14:10 blueorangutan

[SF] Trillian test result (tid-11709) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 43298 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9677-t11709-kvm-ol8.zip Smoke tests completed. 133 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Oct 29 '24 02:10 blueorangutan

@DaanHoogland how to continue with this?

rp- avatar Jan 08 '25 12:01 rp-

@DaanHoogland how to continue with this?

:D I think we can merge. Unless we need more testing for this online change. Personally I think smoke tests must have hit this change multiple times, ...

DaanHoogland avatar Jan 08 '25 12:01 DaanHoogland

For me this is a regression fix. See also this discussion here: https://github.com/apache/cloudstack/discussions/10097

It can't be on purpose for CloudStack to take 15+ mins to detect a down host?

rp- avatar Jan 08 '25 12:01 rp-