cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Disabled the setting `reboot.host.and.alert.management.on.heartbeat.timeout` by default

Open slavkap opened this issue 11 months ago • 20 comments

Description

This PR disables the setting reboot.host.and.alert.management.on.heartbeat.timeout. When there is a storage issue, even if the high availability isn't enabled, CloudStack will reboot the host.

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [X] Bug fix (non-breaking change which fixes an issue)
  • [ ] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI
  • [ ] test (unit or integration test code)

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [ ] Major
  • [ ] Minor
  • [ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

slavkap avatar Dec 16 '24 10:12 slavkap

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 15.12%. Comparing base (a2f2e87) to head (79a5f78).

Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #10111      +/-   ##
============================================
- Coverage     15.13%   15.12%   -0.01%     
+ Complexity    11268    11262       -6     
============================================
  Files          5408     5408              
  Lines        473867   473867              
  Branches      57778    57778              
============================================
- Hits          71700    71684      -16     
- Misses       394165   394185      +20     
+ Partials       8002     7998       -4     
Flag Coverage Δ
uitests 4.30% <ø> (ø)
unittests 15.84% <100.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Dec 16 '24 10:12 codecov[bot]

@slavkap , have you tested this with HA enabled?

DaanHoogland avatar Dec 16 '24 13:12 DaanHoogland

@slavkap can you start a discussion on dev/user mailing list ?

this changes the current behaviour. IMHO, if no objections, we could merge it in 4.21(next major release), but not 4.20/4.19

weizhouapache avatar Dec 16 '24 13:12 weizhouapache

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Dec 17 '24 09:12 github-actions[bot]

@DaanHoogland, I've tested this with and without HA @weizhouapache, sure, I'll start a discussion for this

slavkap avatar Dec 17 '24 09:12 slavkap

@slavkap , I changed the title . Hope you don't mind. It was a bit confusing to me. Are you still looking into this?

DaanHoogland avatar Jan 08 '25 12:01 DaanHoogland

@DaanHoogland, I don't mind the change, thanks! Yes, I opened a discussion in the mailing list for this

slavkap avatar Jan 10 '25 08:01 slavkap

moved forward

DaanHoogland avatar Feb 03 '25 15:02 DaanHoogland

@DaanHoogland, I rebased it on main as @weizhouapache suggested merging it possibly in a major release.

slavkap avatar Feb 03 '25 18:02 slavkap

We experienced the unfortunate event of this issue, causing cascading reboots of all our hosts while the NFS server had no running VM. It was an operational nightmare that resulted in approximately 45 minutes of downtime. Changing its default value to false offers us more gain than loss. We adjusted it to our settings; thank you, Wei. This was simply catastrophic!

boubouX avatar Mar 28 '25 19:03 boubouX

As someone who works with VMware products, I never had an experience where a host reboots when datastore are inaccessible. I believe changing the default for CloudStack to "false" is a great move.

hanisirfan avatar Mar 29 '25 06:03 hanisirfan

@blueorangutan package

sureshanaparti avatar Jun 05 '25 09:06 sureshanaparti

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 05 '25 09:06 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13621

blueorangutan avatar Jun 05 '25 11:06 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 13671

blueorangutan avatar Jun 09 '25 14:06 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13677

blueorangutan avatar Jun 09 '25 16:06 blueorangutan

@blueorangutan test

DaanHoogland avatar Jun 11 '25 17:06 DaanHoogland

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan avatar Jun 11 '25 17:06 blueorangutan

@sureshanaparti , I think we can merge this one, pending smoke tests. But it merits a note in the release notes page for the next version.

DaanHoogland avatar Jun 12 '25 06:06 DaanHoogland

[SF] Trillian test result (tid-13502) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 55426 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10111-t13502-kvm-ol8.zip Smoke tests completed. 141 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Jun 12 '25 09:06 blueorangutan