Timeout for monitor operation
Description
When node have problem with stucked operations, it can brake OpenNebula itself, eg it may be broken disk subsutem, disconneted target or some other problem.
OpenNebula runs a lot of /var/lib/one/remotes/tm/<driver>/monitor operations but they are stuck forever.
To Reproduce Eg right now I have broken LUN and any lvm command is stuck for ages. Try to reproduce that:
- Connect iSCSI target
- Create LVM group
- Create VM in this LVM storage
- Run VM
- Try to disconnect LUN
Now you have broken host, and any lvm command will stuck forever.
Wait for a while, then check ps aux on the opennebula you will se a lots of hanged monitor comands
Expected behavior OpenNebula will return ERROR on this host monitoring and continue monitoring of the rest hosts.
Details
- Affected Component: Storage Drivers
- Hypervisor: KVM
- Version: 5.6.1
Additional context Add any other context about the problem here.
Progress Status
- [ ] Branch created
- [ ] Code committed to development branch
- [ ] Testing - QA
- [ ] Documentation
- [ ] Release notes - resolved issues, compatibility, known issues
- [ ] Code committed to upstream release/hotfix branches
- [ ] Documentation committed to upstream release/hotfix branches
The similar problem was described here: https://github.com/OpenNebula/one/issues/1702
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. The OpenNebula Dev Team
This issue has been automatically closed due to lack of activity/feedback. Please reopen if you have further input or need to bump this. The OpenNebula Dev Team
BTW, I've just started using linstor_un driver instead of LVM, and everything started working as it should. Linstor is having own timeouts for LVM operations.
this is a sound suggestion, reopening it
@paczerny Verify this is still a problem with the new monitoring system
The issue is still there. Looking into linstor driver, it use a nice command timeout -10 monitor..., I suggest to use the same approach
The OpenNebula TM drivers use methods monitor_and_log and ssh_monitor_and_log. We should create duplicates of this methods with timeout parameter and prepend the timeout command, similar for the ssh_ version
The drivers then can set individual timeout.
For the LVM it could be solved by LVM refactor issue #5911
Side note: The methods have some log_error calls, but the error doesn't appear in the oned.log, it may be caused by redirection of stderr to stdout