fence-agents
fence-agents copied to clipboard
fence_lpar: Handle machine that is stuck in powering off
When you power off an LPAR with the fence and the machine is still in the process of powering down the on action is not successful but also not reporting an error.
The desired behaviour would be that the on-command reports that the machine is still powering off.
Looks like when the state of machine is 'Error' it's considered powered off, and power=off command does nothing but power=on does nothing either because it's waiting for it to become powered off.
Sounds like you might need to do some manual intervention if it's in Error-state.
You can see the on/off status code-handling here: https://github.com/ClusterLabs/fence-agents/blob/main/agents/lpar/fence_lpar.py#L23-L26
I think adding 'Error' to the list of 'on' states will resolve the problem.
At least for error states that can be resolved by powering off the LPAR.
Thanks. We'll do some testing and come back to you.
The 'Error' state was observed after kernel panic.
Anything can be done to move this forward?
We havent found a way to get our system in 'Error' state yet, so if you know a way to do that that would be great.
Indeed, it's difficult to reproduce.
While in that particular case the LPAR went to "Error" state on kernel panic most of the time it is listed as "Running".