CheckOnHostCommand: add missing timeout setting
Description
The new CheckOnHostCommand constructor was missing a reasonable timeout value, which meant it would fallback to the wait (1800s) timeout. On a Linstor cluster this resulted in over 15 minutes wait time until a host was recognized as down. With timeout of 20s (as the other constructor) it takes 4-5 mins for a host to become recognized as down.
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
- [ ] test (unit or integration test code)
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
- [ ] Major
- [ ] Minor
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [ ] Major
- [x] Minor
- [ ] Trivial
Screenshots (if appropriate):
How Has This Been Tested?
Failover tests (force shutdown of a host) in a Linstor cluster.
How did you try to break this feature and the system with this change?
Codecov Report
Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
Project coverage is 15.11%. Comparing base (
a0932b0) to head (eca66f8). Report is 95 commits behind head on 4.19.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...n/java/com/cloud/agent/api/CheckOnHostCommand.java | 0.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## 4.19 #9677 +/- ##
============================================
+ Coverage 15.08% 15.11% +0.02%
+ Complexity 11192 11190 -2
============================================
Files 5406 5406
Lines 473215 473214 -1
Branches 61680 58585 -3095
============================================
+ Hits 71386 71521 +135
- Misses 393880 393883 +3
+ Partials 7949 7810 -139
| Flag | Coverage Δ | |
|---|---|---|
| uitests | 4.76% <ø> (+0.46%) |
:arrow_up: |
| unittests | 15.80% <0.00%> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@blueorangutan package
@blueorangutan package
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11163
@blueorangutan package
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11374
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests
[SF] Trillian test result (tid-11709) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 43298 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9677-t11709-kvm-ol8.zip Smoke tests completed. 133 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:
| Test | Result | Time (s) | Test File |
|---|
@DaanHoogland how to continue with this?
@DaanHoogland how to continue with this?
:D I think we can merge. Unless we need more testing for this online change. Personally I think smoke tests must have hit this change multiple times, ...
For me this is a regression fix. See also this discussion here: https://github.com/apache/cloudstack/discussions/10097
It can't be on purpose for CloudStack to take 15+ mins to detect a down host?