pacemaker icon indicating copy to clipboard operation
pacemaker copied to clipboard

Fix: libpe_status: Don't fence a remote node due to failed migrate_from

Open nrwahl2 opened this issue 9 months ago • 4 comments

@clumens @kgaillot This is just a demo for the unnecessary fencing part of T214 / RHEL-23399. I'm not requesting to merge this until we figure out the rest of the failure response/cluster-recheck-interval behavior.

The result of this patch is as follows. Scenario:

  1. ocf:pacemaker:remote resource with reconnect_interval=30s, cluster-recheck-interval=2min, and fencing configured.
  2. Remote connection resource prefers to run on cluster node 2 (location constraint) and is running there.
  3. Put node 2 in standby; remote connection resource migrates to cluster node 1.
  4. Block 3121/tcp on node 2.
  5. Take node 2 out of standby.
  6. Remote connection resource tries to migrate back to node 2. This times out due to the firewall block.

Before patch:

  • The remote node is fenced.
  • The remote connection resource is stopped on node 1 and node 2 due to the multiple-active policy.
  • After fencing, Pacemaker does not attempt to start the remote connection resource until the cluster-recheck-interval expires (or until a new transition is initiated for another reason). Simply finishing fencing does not cause a new transition to run, which would cause the connection resource to try to start. (At least if reconnect_interval has passed.)

After patch:

  • The remote node is not fenced.
  • The remote connection resource immediately tries to recover on node 2 (where it just failed a migrate_from, since start-failure-is-fatal doesn't apply to migrate_from). This entails stopping on both nodes (due to multiple-active policy) and then trying to start on node 2. This will fail due to firewall block.
  • The resource recovers onto node 1 successfully.
  • After reconnect_interval expires, the resource tries to migrate back to node 2 again. Which will fail due to firewall block. This will continue happening every reconnect_interval until migration-threshold is reached.

Fixes T214

nrwahl2 avatar May 13 '24 04:05 nrwahl2