cloudstack
cloudstack copied to clipboard
Disabled the setting `reboot.host.and.alert.management.on.heartbeat.timeout` by default
Description
This PR disables the setting reboot.host.and.alert.management.on.heartbeat.timeout. When there is a storage issue, even if the high availability isn't enabled, CloudStack will reboot the host.
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [X] Bug fix (non-breaking change which fixes an issue)
- [ ] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
- [ ] test (unit or integration test code)
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [ ] Major
- [ ] Minor
- [ ] Trivial
Screenshots (if appropriate):
How Has This Been Tested?
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 15.12%. Comparing base (
a2f2e87) to head (79a5f78).
Additional details and impacted files
@@ Coverage Diff @@
## 4.19 #10111 +/- ##
============================================
- Coverage 15.13% 15.12% -0.01%
+ Complexity 11268 11262 -6
============================================
Files 5408 5408
Lines 473867 473867
Branches 57778 57778
============================================
- Hits 71700 71684 -16
- Misses 394165 394185 +20
+ Partials 8002 7998 -4
| Flag | Coverage Δ | |
|---|---|---|
| uitests | 4.30% <ø> (ø) |
|
| unittests | 15.84% <100.00%> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@slavkap , have you tested this with HA enabled?
@slavkap can you start a discussion on dev/user mailing list ?
this changes the current behaviour. IMHO, if no objections, we could merge it in 4.21(next major release), but not 4.20/4.19
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.
@DaanHoogland, I've tested this with and without HA @weizhouapache, sure, I'll start a discussion for this
@slavkap , I changed the title . Hope you don't mind. It was a bit confusing to me. Are you still looking into this?
@DaanHoogland, I don't mind the change, thanks! Yes, I opened a discussion in the mailing list for this
moved forward
@DaanHoogland, I rebased it on main as @weizhouapache suggested merging it possibly in a major release.
We experienced the unfortunate event of this issue, causing cascading reboots of all our hosts while the NFS server had no running VM. It was an operational nightmare that resulted in approximately 45 minutes of downtime. Changing its default value to false offers us more gain than loss. We adjusted it to our settings; thank you, Wei. This was simply catastrophic!
As someone who works with VMware products, I never had an experience where a host reboots when datastore are inaccessible. I believe changing the default for CloudStack to "false" is a great move.
@blueorangutan package
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13621
Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 13671
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13677
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests
@sureshanaparti , I think we can merge this one, pending smoke tests. But it merits a note in the release notes page for the next version.
[SF] Trillian test result (tid-13502) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 55426 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10111-t13502-kvm-ol8.zip Smoke tests completed. 141 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:
| Test | Result | Time (s) | Test File |
|---|