resource-agents
resource-agents copied to clipboard
Added agents to handle Oracle ASM
I was unsatisfied with the existing oraasm agent, so I have created two new agents: oraasmdg - ASM diskgroup mount - a more thorough approach, which makes use of Oracle cluster (GI) to handle ASM diskgroup, and verify it is mounted on the current node (the original oraasm did not do that). oraacfs - ASM clustered filesystem agent, which makes use of Oracle cluster (GI) to handle the requirements, and then mount the filesystem.
Both agents' purpose is to offload the start task to Oracle cluster, and by doing that, to avoid static timeout definitions. The agents will either start the resources, based on Oracle GI availability (if it is running or not) or just check that they are running, and continue. In either case - any PCS resource depending on these resources will have Oracle ASM or Oracle ACFS available to it, or - will not have its dependencies available (due to a problem) and would attempt to start on another node.
Can one of the admins verify this patch?
Can you also add the agents to doc/man/Makefile.am, so manpages are generated automatically?
Can you also add the agents to doc/man/Makefile.am, so manpages are generated automatically?
Done.
Fixed oraacfs permissions, so Travis can complete checks.
ok to test
Regarding the lack of stop-action. Doesnt the node get fenced when you disable the resource and monitor still reports it as running?
Regarding the lack of stop-action. Doesnt the node get fenced when you disable the resource and monitor still reports it as running?
It did not happen for me. However, if you have multiple clusters running on a set of nodes, one of them should not be with fencing. The best-practices I have for combining these two clusters is by using the fencing mechanism of Oracle GI cluster. It makes sense because I let it manage the disks of the system.
My tests results showed that it worked flawlessly in this current config. PCS attempted to "shut down" the ASM diskgroups (or ASM ACFS filesystems), and got "success", so it moved on.
It is imperative to understand that these two agents manage another cluster and not raw-resources per-se, which results in a slightly different logic. Oracle GI can handle split brains and network access well enough (there are many mechanisms there which handle that). "My" PCS agents are required to keep up - either by waiting for Oracle GI to bring up the resources (or ask it nicely to bring them up) as a condition to any dependent resource, or by monitoring them, and taking actions based on that. The thing is that this dual-cluster configuration, weird as it may sound, can leave Oracle (database) instances not-managed by PCS, but by Oracle GI. We cannot enforce Oracle GI to stop resources we know nothing about...
This is a peculiar situation, it kind of stretches the boundaries of what pacemaker was designed for.
If I understand correctly, these agents are not so much for managing services but serving as checkpoints that can be used with ordering constraints. On the other hand they do have the ability to request a start.
My first suggestion would be to name them to reflect that they're not actually managing the relevant services, though I'm not sure what would be good. Something like ExternalOracleASM, OracleManagedASM, or CheckOracleASM maybe?
Regarding the stop, the main issue I see has to do with probes. The monitor action has no way of distinguishing whether the pacemaker resource (as opposed to the actual service) is active or not which will likely cause all sorts of corner cases with the timing of probes. I would structure these agents like the Dummy agent, such that it uses the presence of a local file to know whether it itself is running. I.e. start would touch the file before proceeding with the current start implementation, and stop would remove the file. Then monitor would return "not running" if the file is not present, which would handle probes.
Regarding fencing, I feel the proposed setup is unsafe. Pacemaker needs a way to fence a node in case of corosync loss or failed stop of some other resource. Probably the ideal solution would be to write a new fence agent that proxies requests to whatever fencing mechanism Oracle cluster offers. This would be similar to how existing fence agents for cloud services work.
Regarding waiting for Oracle GI to be up, that feels awkward to me. Since you are already writing agents for checkpointing Oracle resources, why not one more for checkpointing GI itself? Then you can configure a separate timeout for that so waiting for GI doesn't eat into the other resources' timeouts.
First - I agree with you. This is a peculiar situation. I have derived this agent (and also - its name) from another agent called 'oraasm' which starts Oracle cluster, and doesn't wait anything. Initially - I wanted to use it, however - it interfered with Oracle cluster, and rendered it mostly useless. Still - the name is derived, and if you come to a conclusion a different name is required - it can be done, although I take it that the agent does ask Oracle GI to start the resource.
I have experienced no problem, through my tests, with PCS behaviour. I have failed nodes, booted nodes one at a time, or together, and it worked well for me. I am very unhappy with PCS tendency to fence a node if a resource cannot stop. This is extremely troublesome when you add a new resource to an existing cluster, where production workload exists, and due to some script or parameter error, your node fences all of a sudden. In my opinion - fencing a node due to 'stop' failure should not be the default, and should be configurable. Especially where the op parameters of retry and failure handling is limited on initial resource creation. It did put me in some bad places on production systems... The probes did not respond bad to OCF_SUCCESS on 'status' checks after the resource was supposed to be down. These probes are meaningful for an active/passive resource - then it is imperative to verify that the resource is active only on a single node (default resource policy requires uniqueness). However, if created as a clone (as it should be!) on all the relevant nodes, then PCS takes that the resource is running as granted.
I have elected to integrate the 'CRS start' command within this agent to reduce PCS complexity and to ease the process. Oracle GI (CRS) should start on boot. If not - then the original oraasm agent can call it (It calls the main Oracle GI cluster orchestrator - ' has'), but it is not required unless used on a stand-alone mode. These timeouts were selected according to my tests, and should suffice - if CRS cannot start for longer than that (and it should mount the ASM diskgroups as part of its startup, so the 'start' operation is just a safeguard), then there are no depending resource we will want to run here, and the status of the resource would be 'stopped' (or failed to start), which is exactly the desired state it should be at (now - fix your Oracle GI startup, and then cleanup PCS resources).
As said - this is a very specific and particular agent for a specific and particular case. I can see several use-cases for it - none of them involves non-cloned resources. Such cases would be Oracle Active/Passive DB instance, you do not want to manage through Oracle GI ; KVM virtual machine running over ACFS clustered filesystem, Apache web server, using ACFS as shared storage - and so on. In all these cases, the shared filesystem can (and might) serve other applications/usages, which are unrelated to the PCS cluster, and should not, necessarily, be managed by it. That is reason why I have chosen to use the 'fake' stop operation. And - as mentioned before - throughout my tests - it showed no ill-effect.
The Dummy approach should make it able to detect being stopped by running touch/rm in start/stop-actions and if ! -f ... return $OCF_NOT_RUNNING to return early for the monitor/status actions when the state file isnt present
Link to Dummy code: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Dummy#L132
Regarding naming I think e.g. oracle-managed-asm-dg would be good here (also following our naming convention for new agents https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc#is-there-a-naming-convention).
I have experienced no problem, through my tests, with PCS behaviour. I have failed nodes, booted nodes one at a time, or together, and it worked well for me. I am very unhappy with PCS tendency to fence a node if a resource cannot stop. This is extremely troublesome when you add a new resource to an existing cluster, where production workload exists, and due to some script or parameter error, your node fences all of a sudden. In my opinion - fencing a node due to 'stop' failure should not be the default, and should be configurable.
It is configurable actually. When you configure the resource, you can configure the stop operation to have on-fail="block" which will make the cluster not do anything further with the resource if the stop fails. With pcs it's like "pcs resource create ... op stop on-fail=block"
The probes did not respond bad to OCF_SUCCESS on 'status' checks after the resource was supposed to be down. These probes are meaningful for an active/passive resource - then it is imperative to verify that the resource is active only on a single node (default resource policy requires uniqueness). However, if created as a clone (as it should be!) on all the relevant nodes, then PCS takes that the resource is running as granted.
There may not be any immediately obvious ill effects, but my instinct is that there are corner cases that will eventually cause problems.
If a probe finds the resource running when the resource needs to be started, then pacemaker will consider the resource already started, and not run the agent's start action. If a probe finds the resource running when the resource needs to be stopped, then pacemaker will schedule a stop action, and consider the resource stopped when it completes successfully -- yet if another probe is scheduled for any reason, it will find it running again, causing another stop to be scheduled. This sort of messing with pacemaker's internal resource state tracking is bound to cause problems somewhere.
Using a file to determine whether the pacemaker resource is active or not would allow the user to separate pacemaker's monitoring from the oracle resource itself. It would be possible to disable the pacemaker resource, which would stop pacemaker's monitoring as well as any resources dependent on that one, without having to stop the oracle resource.
We can't assume the clone will always be active, and always active on all nodes. A user can disable a resource, or ban a clone from one node.
It occurs to me that failure recovery is a (separate) concern. If the Oracle resource fails, by default pacemaker will respond by calling the agent's stop action and then its start action. That won't actually accomplish anything, so pacemaker may end up looping on that. The only other option I see is setting on-fail=block on the monitor operation, so that pacemaker freezes if the oracle resource fails.
I do not have this testing environment anymore. I have performed start/stop operations repeatedly, and, following a very thorough test I have performed, had a production-ready environment. I cannot perform another test(s) in the following days. I will have to setup a testing lab, now, and see how I can setup a cluster similar to this. It is imperative to note that the agent is not meant to manage Oracle ASM resources, but just to allow 3rd party applications to orchestrate based on its running state. PCS, much like other clusters, do not support agents with limited capabilities - start and monitor, but never stop. The design does not allow it. This is one of these specific cases where it was required. Using a 'fake lock' file would introduce a set of terrible corner cases, as you have mentioned - what if the file pre-exists, but ASM is down? Then depending resources would (attempt to) start, because there is no need to start a resource if it's already up. Monitoring is going to be hell as well, and the amount of edge cases I leave the reader to imagine. This agent does not collide with any other agent, it does not hinder the use of any other agent, and it comes with a README explaining the use case, and the limitations (as well as how it actually works behind the curtains). It can be a great add-on for mixed clusters setup, assuming the integrator takes the required steps to set it up correctly. This is, in a way, the case with any agent. Not once nor twice did a cluster node fence because some argument of a resource was set incorrectly, or the cluster incorrectly identified a failed-to-start and then failed-to-stop state. It makes me shiver any time I am updating a working production cluster with a new resource. In the rare cases I will need this agent, I will use it. I wish to make it available to others, with the respective "We take no responsibility" disclaimer, which is the case with most agents anyhow. I should also add that this is one of the least aggressive agents - it does not fail to stop, ever, and thus - will not crush your cluster. So - assuming I cannot further test this agent in the near future, assuming it worked well (as expected, including node failure, force stop on the Oracle GI level, node standby/unstandby and so on) - are there any constructive changes anyone thinks I should make? As far as I have tested, which was about 6-7 hours tests, this is a production-ready agent, given the documented constraints.
It still concerns me that a monitor failure of this resource will lead to pacemaker attempting to restart it, which won't accomplish anything, and that it's not possible to distinguish whether the monitoring resource is running versus whether the actual disk group is up.
This got me to thinking that the closest parallel is the ocf:pacemaker:ping agent, which sets a node attribute according to whether an IP address (not managed by the cluster) is pingable. Other resources can then use rule-based constraints according to the attribute, rather than constraining versus the ping resource itself.
I think that approach would be good here. The agent itself would be like a dummy, just using a file to indicate started/stopped. But it would also set a node attribute according to whether the disk group is up or not. Other resources could then be located where the attribute is good.
It still concerns me that a monitor failure of this resource will lead to pacemaker attempting to restart it, which won't accomplish anything, and that it's not possible to distinguish whether the monitoring resource is running versus whether the actual disk group is up. The agent will try to restart it, that is true. So monitoring works, and start works - if possible. So it's either the resource (ASM diskgroup) starts or it does not. If it starts - the cluster has performed its task (call GI to start ASM diskgroup), and if it is not - then failure to start a resource will enforce all dependant resources to start somewhere else.
This resource is not too far away from your suggested ping - it cannot orchestrate the whole GI cluster (which is a huge product) but it will 'push' GI to start ASM diskgroup (which is defined by default to start only if requested to, or if there is a GI resource which depends on it - and in our case - this is not a must), and it will monitor its status - so depending resources can schedule location/state according to its status. The only thing not done here was to handle the 'stop' signal, which might fail if there are depending GI resources PCS is not aware of, or kill these said resources violently. There are (rare) cases where ASM diskgroup dismounts, and this resource will bring it up (actively!) and respond correctly.
It still concerns me that a monitor failure of this resource will lead to pacemaker attempting to restart it, which won't accomplish anything, and that it's not possible to distinguish whether the monitoring resource is running versus whether the actual disk group is up. The agent will try to restart it, that is true. So monitoring works, and start works - if possible. So
I mean pacemaker will restart the resource itself, not the disk group (i.e. call the agent's stop action and then start action). If the monitor detects a problem, doing that restart with the current code won't change anything.
it's either the resource (ASM diskgroup) starts or it does not. If it starts - the cluster has performed its task (call GI to start ASM diskgroup), and if it is not - then failure to start a resource will enforce all dependant resources to start somewhere else.
This resource is not too far away from your suggested ping - it cannot orchestrate the whole GI cluster (which is a huge product) but it will 'push' GI to start ASM diskgroup (which is defined by default to start only if requested to, or if there is a GI resource which depends on it - and in our case - this is not a must), and it will monitor its status - so depending resources can schedule location/state according to its status. The only thing not done here was to handle the 'stop' signal, which might fail if there are depending GI resources PCS is not aware of, or kill these said resources violently.
Setting a node attribute would separate the status of the disk group itself, from the status of the remote check of the disk group. I.e.:
start action = touch state file, and do what it currently does
monitor action = check whether state file exists (which is the only thing that determines whether this resource is running or not running), and set a node attribute based on whether disk group is active
stop action = remove state file
With that design, dependent resources would not be ordered after this resource, but instead located where the node attribute indicates the disk group is up.
That allows pacemaker to manage the resource correctly, i.e. stop actually results in the resource's state being "not running". The resource state would be separate from the disk group's state, which is indicated by the node attribute instead.
There are (rare) cases where ASM diskgroup dismounts, and this resource will bring it up (actively!) and respond correctly.
I have had a 'clash' with PCS recently where it has detected resources to be alive, when they were not called. The 'monitor' command was buggy in that particular agent. The bottom line is that PCS behaves like this:
- If the resource is active, do not activate it (assume it's working because the 'monitor' has returned that it works)
- If the resource inactive and is required to start - start it
- If the resource is active and needs to be stopped - stop it.
- If the resource is supposed to be off, do not monitor it (even if it's on - PCS doesn't know it)
This has resulted in a scenario which matches my use-case - the resource is either on or off. If it's on - PCS will either monitor or (attempt to) start it, and later on - monitor it. If the resource is off when it's supposed to be on - PCS will either attempt to start it, or fail (which is the correct behaviour, as I see it). If the resource is supposed to be off - PCS will not start it, and will not monitor it - so it doesn't matter what its "real" status is. If PCS is meant to take the resource down - it will run the 'stop' sub-command, and leave it be.
Unfortunately, my test environment is not a test environment no more, and since the customer has come to their senses (went back from integration between clusters) - I can't see how I can further test or develop this agent. However - there is an agent for Oracle ASM, which does not work well at all, and given its condition and place in this repository, I think that a better (albeit - not perfect) agent needs to take its place. This agent was written for this purpose, and was tested rather thoroughly. I can firmly say that when the resource is in 'stopped' mode, no ill effect happens to a cluster node, or its depending resources. Incorrect configuration (as a unique resource) will lead to problems, but only until this resource is configured as a clone. I have documented that down in the attached readme file. I think that this agent can allow the (very) few who need integration between both clusters to work it out, and it is a base for development of additional agents integrating timing between external (limited controlled) resources and PCS resources, even as reference. What I'm saying is 'it does no harm, and would benefit from further tests, so state that this is not a full production agent just yet, and merge it already'. Thanks