fence-agents
fence-agents copied to clipboard
fence_heuristics_resource: add new fence-agent for dynamic delay fencing
In a two-node cluster, if a fence races occur during running service on the standby node (= the delay of fence-agent is set) after failover due to monitoring failure on active node etc, the standby node will be fenced, and service cannot be continued. This is due to the delay setting of fencing is fixed to the node.
proposal (this PR)
Implements a new heuristics agent to determine the node where the service (resource) is running and dynamically delay fencing.
Signed-off-by: Kumabuchi Kenji [email protected]
Can one of the admins verify this patch?
ok to test
@wenningerk Thank you for your feedback. I think it's not so difficult to fix the description and consider the promotable resources. But I have no idea to cover the self-fence. Is there a good way to detect it?
Certainly, in the case of self-fence, a wasteful delay occurs. But honestly, I think it's no different from a delay parameter fixed to a node.
@wenningerk Thank you for your feedback. I think it's not so difficult to fix the description and consider the promotable resources. But I have no idea to cover the self-fence. Is there a good way to detect it?
Certainly, in the case of self-fence, a wasteful delay occurs. But honestly, I think it's no different from a delay parameter fixed to a node.
Well you know who you are (crm_node --name) and you know the target ;-)
Was talking as well to Ken yesterday about that approach and he is sharing my feeling that doing something complicated like determining the location of a resource should be considered carefully as part of a fence-agent. At least a note in the fence-agent should state that this maybe puts more stress on keeping the one instance running than on reliable recovery. If you put that fence-agent on the same level as the real fence-agent and determining the location fails or hangs this node would never fence. So one possibility to think about would be to rather put it on a higher level as the real fence-agent and to always make it fail. Then if determining the location is working it might introduce a delay and if that fails we would at least still have fencing. Could be an option as well so that the user can select if he wants to weight reliable recovery higher than keeping the service running.
@wenningerk & KEN Thank you very much for considering this proposal. I understand that the current approach has the concerns.
So one possibility to think about would be to rather put it on a higher level as the real fence-agent and to always make it fail. Then if determining the location is working it might introduce a delay and if that fails we would at least still have fencing.
I certainly think that the above approach is safer. (In fact, I thought about it, but the implementation of always returning FALSE seemed strange and I stopped it.) I will try to fix PR.
I certainly think that the above approach is safer. (In fact, I thought about it, but the implementation of always returning FALSE seemed strange and I stopped it.) I will try to fix PR.
If you feel unpleasant with it leave the default as is and make returning FALSE an option ...
Sorry for the late response. Thank you for many important comments. I updated PR, and it tested on RHEL8.0 / Pacemaker 2.0.1. But I'm not familiar with crm_mon's xml output. Please let me know if you have other comments.
Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ...
Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent.
How about we introduce a couple of meta attributes for resources for example:
fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.
fencing-delay-promoted
-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches.
Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay.
What do you think?
Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ...
Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent.
How about we introduce a couple of meta attributes for resources for example:
fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.
fencing-delay-promoted
-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches.
Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay.
What do you think?
That's an interesting idea -- I was hesitant about agents getting that involved in pacemaker's internal state, and that would avoid any such concerns.
We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.
The fencer uses the "quick location" flag with the scheduler, so I'm not sure offhand whether it has all the information needed.
Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ... Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent. How about we introduce a couple of meta attributes for resources for example:
fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.fencing-delay-promoted
-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches. Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay. What do you think?That's an interesting idea -- I was hesitant about agents getting that involved in pacemaker's internal state, and that would avoid any such concerns.
We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.
The fencer uses the "quick location" flag with the scheduler, so I'm not sure offhand whether it has all the information needed.
Guess this is bringing up the old issue again that the view fenced has of the state of the cluster is just updated by a few rules and so it wouldn't know which delay to take as atm it isn't aware of a clone being promoted or not. But of course we could go a route where scheduler pushes some additional delay into fenced where fenced would be agnostic of how that is composed - either a parameter of the fence-action or asynchronous via a side-channel.
But of course we could go a route where scheduler pushes some additional delay into fenced where fenced would be agnostic of how that is composed - either a parameter of the fence-action or asynchronous via a side-channel.
I like the idea of the scheduler calculating the delay and adding it to the graph action. However that would mean it would only apply to fencing initiated by the cluster (and not e.g. stonith_admin or dlm).
Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?
We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.
Good thinking. If to just involve "priority" meta attribute, that'd be fine. Not sure if it'd become confusing to introduce another "promoted-priority" attribute... A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)
Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?
Up to now we've tried to avoid the overhead. The fencer checks the status at noncritical times like when the CIB is changed. When fencing is requested, we kind of want that to be as streamlined as possible, both to avoid unnecessary delays and to reduce the chance that something could go wrong. If we want status then, we have to get the current CIB, so worst case we have to wait for that to timeout before giving up.
But it would make it easier to have dynamic capabilities like rule processing and this.
If we did check status at fence time, we'd have to define what to do when reading the current CIB fails. Do we go with the most recent copy we have, or fail? If we use the most recent copy, do we disable the priority feature to avoid using outdated information?
A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)
That makes sense. Scores are integer so I'd just go with +1 rather than convert back and forth from floating-point. The +1 for implicit clone stickiness establishes a precedent. In my experience users tend to use bigger jumps between configured scores anyway (at least 10).
Overall I lean to doing the calculations in the scheduler (and losing the feature for external fencing) rather than making the fencer more complex, but I can see arguments both ways.
Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?
Up to now we've tried to avoid the overhead. The fencer checks the status at noncritical times like when the CIB is changed. When fencing is requested, we kind of want that to be as streamlined as possible, both to avoid unnecessary delays and to reduce the chance that something could go wrong. If we want status then, we have to get the current CIB, so worst case we have to wait for that to timeout before giving up.
But it would make it easier to have dynamic capabilities like rule processing and this.
If we did check status at fence time, we'd have to define what to do when reading the current CIB fails. Do we go with the most recent copy we have, or fail? If we use the most recent copy, do we disable the priority feature to avoid using outdated information?
It'd be a difficult situation indeed ...
A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)
That makes sense. Scores are integer so I'd just go with +1 rather than convert back and forth from floating-point. The +1 for implicit clone stickiness establishes a precedent. In my experience users tend to use bigger jumps between configured scores anyway (at least 10).
Makes sense.
Overall I lean to doing the calculations in the scheduler (and losing the feature for external fencing) rather than making the fencer more complex, but I can see arguments both ways.
Agreed. There probably always have to be trade-offs. I'll start with something on this.
And again a question is, should fencer combine priority-fencing-delay
with pcmk_delay_base/max
, or should it just pick the longest among them as the actual delay?
And again a question is, should fencer combine
priority-fencing-delay
withpcmk_delay_base/max
, or should it just pick the longest among them as the actual delay?
My first reaction was that they should combine, but thinking about possible scenarios, I think priority-fencing-delay should always have precedence.
The idea is that the user could configure priority-fencing-delay, which would always be used for cluster-initiated fencing, and then configure a static and/or random delay as a fallback to be used with external fencing. If the fencer receives a request with a priority delay specified, it would use that, otherwise it would use the normal delay parameters. That means the scheduler should always add it to fencing ops, even if it's 0 (i.e. targeting a lower-priority node), if the option is enabled.
The only scenario I can think of where combining them would be useful would be for when two nodes have equal priority. Someone could configure a static and/or random delay totalling smaller than the priority delay, which would only matter in that situation. But since by definition the user doesn't have a preference which node gets fenced in that case, it should be acceptable to pick one node algorithmically (lowest ID or whatever) to get a +1 in that situation.
A mixed-version cluster might be another scenario where they're useful, but that should be a temporary situation.
I wonder if it would be worthwhile to have some exceptions, e.g. don't use a delay if we're fencing a node in our partition, and/or use the delay only if there's an even number of nodes. But presumably the user wouldn't configure a delay for an odd number of nodes.
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/fence-agents-pipeline/job/PR-308/1/input