fence-agents icon indicating copy to clipboard operation
fence-agents copied to clipboard

fence_lpar: Handle machine that is stuck in powering off

Open SchoolGuy opened this issue 10 months ago • 9 comments

When you power off an LPAR with the fence and the machine is still in the process of powering down the on action is not successful but also not reporting an error.

The desired behaviour would be that the on-command reports that the machine is still powering off.

SchoolGuy avatar Feb 04 '25 11:02 SchoolGuy

Looks like when the state of machine is 'Error' it's considered powered off, and power=off command does nothing but power=on does nothing either because it's waiting for it to become powered off.

hramrach avatar Feb 04 '25 11:02 hramrach

Sounds like you might need to do some manual intervention if it's in Error-state.

You can see the on/off status code-handling here: https://github.com/ClusterLabs/fence-agents/blob/main/agents/lpar/fence_lpar.py#L23-L26

oalbrigt avatar Feb 04 '25 12:02 oalbrigt

I think adding 'Error' to the list of 'on' states will resolve the problem.

hramrach avatar Feb 04 '25 12:02 hramrach

At least for error states that can be resolved by powering off the LPAR.

hramrach avatar Feb 04 '25 12:02 hramrach

Thanks. We'll do some testing and come back to you.

oalbrigt avatar Feb 05 '25 09:02 oalbrigt

The 'Error' state was observed after kernel panic.

hramrach avatar Feb 05 '25 10:02 hramrach

Anything can be done to move this forward?

hramrach avatar Mar 21 '25 09:03 hramrach

We havent found a way to get our system in 'Error' state yet, so if you know a way to do that that would be great.

oalbrigt avatar Mar 21 '25 12:03 oalbrigt

Indeed, it's difficult to reproduce.

While in that particular case the LPAR went to "Error" state on kernel panic most of the time it is listed as "Running".

hramrach avatar Mar 21 '25 14:03 hramrach