fence-agents icon indicating copy to clipboard operation
fence-agents copied to clipboard

fence_kdump: monitor action does not work correctly.

Open knakahira opened this issue 10 years ago • 5 comments

fence_kdump monitor action checks local node only and does not checks target node. (It is described in a commit log "monitor action checks if LOCAL node can enter kdump")

It makes no sense because fence_kdump have to check target node configuration. And it is difficult to check target node without ssh or other remote shell command.

I have no ideas to resolve this issue. Anyone have ideas?

knakahira avatar Mar 19 '15 07:03 knakahira

It makes sense for cluster scenarios, if fence_kdump is configured then it is configured on every node. So, checking it locally makes sense as every target node will be checked that way.

I understand your concerns but imho there is not a real solution because there is nothing running before problem occurs. Also, we test just kernel args but it would be better to check if fence_kdump_send is contained in kdump kernel and if it will be executed.

marxsk avatar Mar 19 '15 08:03 marxsk

OK, I understand that fence_kdump resource must start on every node to cover all cluster members. And administrator have to be careful that fence_kdump monitor error means local node error, it is not target node error(It is different from other fence agents behavior).

It seems that 1+1 cluster is no ploblem, but N+M cluster have to take care about location constraints. Because there is no guarantee that fence agents are distributed equally throughout the cluster and it need to prohibit fence_kdump resource fail over.

knakahira avatar Mar 20 '15 02:03 knakahira

yes, it is different than other agents.

But I'm not sure where you see problem with bigger clusters. You should have fence_kdump on every node, so there is no issue at all. If (for whatever reason) you do not want to have fence_kdump install on particular node, just set '' pcmk_monitor_action="metadata" '' like it was before.

marxsk avatar Mar 20 '15 17:03 marxsk

We configure STONITH resources as a group resource like the follonwing.

node1: grpSTONITH_node2(fence_kdump_node2 + ipmi_node2) node2: grpSTONITH_node3(fence_kdump_node3 + ipmi_node3) node3: grpSTONITH_node1(fence_kdump_node1 + ipmi_node1)

If node2 crash, then grpSTONITH_node3 fail over to node1. (grpSTONITH_node3 can not run on the node3) After resotore node2, grpSTONITH_node3 keep running on node1 and fence_kdump can not checks restored node2.

node1: grpSTONITH_node2 grpSTONITH_node3 node2: none node3: grpSTONITH_node1

It is no problem with other fence_agents because they can run on every nodes except target node. Administrator take no care about location of fence_agents. But fence_kdump is not.

This is why N+M cluster have to take care about location constraints. fence_kdump needs auto fail back configuration or manually fail back operation.

knakahira avatar Mar 23 '15 01:03 knakahira

I'm planning to configure 3 nodes cluster with fence_kdump and fence_ipmilan. This configuration is the same as knakahira described above. What did you come up with to solve this problem?

ikedaj avatar Jul 13 '16 04:07 ikedaj