Fix: libpe_status: Don't fence a remote node due to failed migrate_from
@clumens @kgaillot This is just a demo for the unnecessary fencing part of T214 / RHEL-23399. I'm not requesting to merge this until we figure out the rest of the failure response/cluster-recheck-interval behavior.
The result of this patch is as follows. Scenario:
-
ocf:pacemaker:remoteresource withreconnect_interval=30s,cluster-recheck-interval=2min, and fencing configured. - Remote connection resource prefers to run on cluster node 2 (location constraint) and is running there.
- Put node 2 in standby; remote connection resource migrates to cluster node 1.
- Block 3121/tcp on node 2.
- Take node 2 out of standby.
- Remote connection resource tries to migrate back to node 2. This times out due to the firewall block.
Before patch:
- The remote node is fenced.
- The remote connection resource is stopped on node 1 and node 2 due to the multiple-active policy.
- After fencing, Pacemaker does not attempt to start the remote connection resource until the cluster-recheck-interval expires (or until a new transition is initiated for another reason). Simply finishing fencing does not cause a new transition to run, which would cause the connection resource to try to start. (At least if
reconnect_intervalhas passed.)
After patch:
- The remote node is not fenced.
- The remote connection resource immediately tries to recover on node 2 (where it just failed a
migrate_from, sincestart-failure-is-fataldoesn't apply tomigrate_from). This entails stopping on both nodes (due to multiple-active policy) and then trying to start on node 2. This will fail due to firewall block. - The resource recovers onto node 1 successfully.
- After
reconnect_intervalexpires, the resource tries to migrate back to node 2 again. Which will fail due to firewall block. This will continue happening everyreconnect_intervaluntilmigration-thresholdis reached.
Fixes T214
Marking ready for review. This might be sufficient to fix the cluster-recheck-interval behavior too (rather than just masking it)... Since we no longer set pcmk_on_fail_reset_remote, we also don't set the role-after-failure to stopped anymore. We can recover right away instead of waiting for a later transition.
enum rsc_role_e
pcmk__role_after_failure(const pcmk_resource_t *rsc, const char *action_name,
enum action_fail_response on_fail, GHashTable *meta)
{
...
// Set default for role after failure specially in certain circumstances
switch (on_fail) {
...
case pcmk_on_fail_reset_remote:
if (rsc->remote_reconnect_ms != 0) {
role = pcmk_role_stopped;
}
break;
If this isn't a viable solution (or close to it) as-is, @clumens or anyone else can feel free to take it and run with it themselves. I took a crack at it since I've been talking to Chris about it a lot last week.
Okay, there is at least one big wrinkle in this... if a resource is running on the remote node when the connection resource migrate_from fails, then we still fence the remote node, which results in the remote connection resource being stopped due to node availability until the timer pops :(
May 12 23:07:45.302 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item) notice: Actions: Fence (reboot) fastvm-fedora39-23 'dummy is thought to be active there'
May 12 23:07:45.303 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item) notice: Actions: Recover fastvm-fedora39-23 ( fastvm-fedora39-24 )
May 12 23:07:45.303 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item) notice: Actions: Move dummy ( fastvm-fedora39-23 -> fastvm-fedora39-22 )
...
May 12 23:07:48.230 fastvm-fedora39-22 pacemaker-fenced [8719] (finalize_op) notice: Operation 'reboot' targeting fastvm-fedora39-23 by fastvm-fedora39-22 for pacemaker-controld.8723@fastvm-fedora39-22: OK (complete) | id=b8a4cae2
...
# Transition abort due to connection resource monitor failure,
# presumably due to remote node fenced
May 12 23:07:48.257 fastvm-fedora39-22 pacemaker-controld [8723] (abort_transition_graph) info: Transition 8 aborted by status-1-fail-count-fastvm-fedora39-23.monitor_60000 doing create fail-count-fastvm-fedora39-23#monitor_60000=1: Transient attribute change | cib=0.115.19 source=abort_unless_down:305 path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'] complete=true
...
May 12 23:07:48.262 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item) notice: Actions: Stop fastvm-fedora39-23 ( fastvm-fedora39-24 ) due to node availability
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item) notice: Actions: Stop fastvm-fedora39-23 ( fastvm-fedora39-22 ) due to node availability
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (log_list_item) notice: Actions: Start dummy ( fastvm-fedora39-24 )
May 12 23:07:48.263 fastvm-fedora39-22 pacemaker-schedulerd[8722] (pcmk__log_transition_summary) error: Calculated transition 9 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-26.bz2
This is because the remote node is considered offline after the migration failure.
(determine_remote_online_status) trace: Remote node fastvm-fedora39-23 presumed ONLINE because connection resource is started
(determine_remote_online_status) trace: Remote node fastvm-fedora39-23 OFFLINE because connection resource failed
(determine_remote_online_status) trace: Remote node fastvm-fedora39-23 online=FALSE
I don't think we want to skip setting the pcmk_rsc_failed flag for the connection resource, so we'd need to somehow detect that the failure was for migrate_from and avoid marking the remote node offline in that case. Spitballing here, we could maybe overload partial_migration_{source,target} so that we have them even after a failed migration... Or add a new flag like failed_migrate_from. (migrate_to may not warrant this treatment.) Or maybe it's still more complicated.
A failed migrate_from is somewhere between a partial migration and a dangling migration. No stop has been run on the source. The migrate_from action completed (so not partial) but failed (so not dangling).
Still not working. Latest push rebases on main, adds another test, and adds some experimental commits on top.
I'm thinking the high-level behavior should be:
- Only stop failures and recurring monitor failures for the connection resource should cause the remote node to be fenced.
- Only recurring monitor failures of the connection resource should change the online state of the remote node (though it's fine if some others set it explicitly). Currently any failure sets it to offline.
- start: offline -> offline
- stop: online -> online
- probe: X -> X. No info; the probe could be run on a node that can't connect to the remote node, while the resource is started or the probe succeeds on another node.
- reload: X -> X (it's a no-op so doesn't really matter)
- migrate_to/migrate_from: online -> online. The migration sequence is migrate_to -> migrate_from -> stop on source, so we haven't stopped the resource yet at this point)
- recurring monitor: online -> offline
That would still leave some decisions about exactly how and where in the call chain to implement the online-state patch. The possibility of multiple failed actions in the history, occurring in various orders, gives me a headache.
- After fencing, Pacemaker does not attempt to start the remote connection resource until the cluster-recheck-interval expires (or until a new transition is initiated for another reason). Simply finishing fencing does not cause a new transition to run, which would cause the connection resource to try to start. (At least if
reconnect_intervalhas passed.)
Successful actions (including fencing) do not and should not trigger new transitions; every transition should be complete within itself, and a transition should need to be recalculated only if it's interrupted by a new event or some action fails.
What's missing in this case is a call to pe__update_recheck_time(), which says when a time-based interval will expire that might require recalculation. We use this for failure expiration, time-based rules, etc., to let the controller know when to recheck. The cluster-recheck-interval is a failsafe for bugs like this one.
This will be good to revisit in the future, but closing for now due to the uncertainties and significant changes in the code base since. Worth summarizing the discussion on the task.