pacemaker
pacemaker copied to clipboard
Fix: libpe_status: Don't fence a remote node due to failed migrate_from
@clumens @kgaillot This is just a demo for the unnecessary fencing part of T214 / RHEL-23399. I'm not requesting to merge this until we figure out the rest of the failure response/cluster-recheck-interval behavior.
The result of this patch is as follows. Scenario:
-
ocf:pacemaker:remote
resource withreconnect_interval=30s
,cluster-recheck-interval=2min
, and fencing configured. - Remote connection resource prefers to run on cluster node 2 (location constraint) and is running there.
- Put node 2 in standby; remote connection resource migrates to cluster node 1.
- Block 3121/tcp on node 2.
- Take node 2 out of standby.
- Remote connection resource tries to migrate back to node 2. This times out due to the firewall block.
Before patch:
- The remote node is fenced.
- The remote connection resource is stopped on node 1 and node 2 due to the multiple-active policy.
- After fencing, Pacemaker does not attempt to start the remote connection resource until the cluster-recheck-interval expires (or until a new transition is initiated for another reason). Simply finishing fencing does not cause a new transition to run, which would cause the connection resource to try to start. (At least if
reconnect_interval
has passed.)
After patch:
- The remote node is not fenced.
- The remote connection resource immediately tries to recover on node 2 (where it just failed a
migrate_from
, sincestart-failure-is-fatal
doesn't apply tomigrate_from
). This entails stopping on both nodes (due to multiple-active policy) and then trying to start on node 2. This will fail due to firewall block. - The resource recovers onto node 1 successfully.
- After
reconnect_interval
expires, the resource tries to migrate back to node 2 again. Which will fail due to firewall block. This will continue happening everyreconnect_interval
untilmigration-threshold
is reached.
Fixes T214