one icon indicating copy to clipboard operation
one copied to clipboard

Timeout for monitor operation

Open kvaps opened this issue 7 years ago • 7 comments

Description When node have problem with stucked operations, it can brake OpenNebula itself, eg it may be broken disk subsutem, disconneted target or some other problem. OpenNebula runs a lot of /var/lib/one/remotes/tm/<driver>/monitor operations but they are stuck forever.

To Reproduce Eg right now I have broken LUN and any lvm command is stuck for ages. Try to reproduce that:

  • Connect iSCSI target
  • Create LVM group
  • Create VM in this LVM storage
  • Run VM
  • Try to disconnect LUN

Now you have broken host, and any lvm command will stuck forever. Wait for a while, then check ps aux on the opennebula you will se a lots of hanged monitor comands

Expected behavior OpenNebula will return ERROR on this host monitoring and continue monitoring of the rest hosts.

Details

  • Affected Component: Storage Drivers
  • Hypervisor: KVM
  • Version: 5.6.1

Additional context Add any other context about the problem here.

Progress Status

  • [ ] Branch created
  • [ ] Code committed to development branch
  • [ ] Testing - QA
  • [ ] Documentation
  • [ ] Release notes - resolved issues, compatibility, known issues
  • [ ] Code committed to upstream release/hotfix branches
  • [ ] Documentation committed to upstream release/hotfix branches

kvaps avatar Dec 17 '18 00:12 kvaps

The similar problem was described here: https://github.com/OpenNebula/one/issues/1702

kvaps avatar Dec 17 '18 00:12 kvaps

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. The OpenNebula Dev Team

stale[bot] avatar Dec 17 '19 01:12 stale[bot]

This issue has been automatically closed due to lack of activity/feedback. Please reopen if you have further input or need to bump this. The OpenNebula Dev Team

stale[bot] avatar Jan 16 '20 01:01 stale[bot]

BTW, I've just started using linstor_un driver instead of LVM, and everything started working as it should. Linstor is having own timeouts for LVM operations.

kvaps avatar Jan 16 '20 06:01 kvaps

this is a sound suggestion, reopening it

tinova avatar Jan 16 '20 09:01 tinova

@paczerny Verify this is still a problem with the new monitoring system

rsmontero avatar Nov 08 '21 10:11 rsmontero

The issue is still there. Looking into linstor driver, it use a nice command timeout -10 monitor..., I suggest to use the same approach

The OpenNebula TM drivers use methods monitor_and_log and ssh_monitor_and_log. We should create duplicates of this methods with timeout parameter and prepend the timeout command, similar for the ssh_ version The drivers then can set individual timeout.

For the LVM it could be solved by LVM refactor issue #5911

Side note: The methods have some log_error calls, but the error doesn't appear in the oned.log, it may be caused by redirection of stderr to stdout

paczerny avatar Sep 14 '23 17:09 paczerny