resource-agents icon indicating copy to clipboard operation
resource-agents copied to clipboard

RA: Filesystem: bind mount point unmounting postcondition check failure (for being member of 2+ mounts in misordered chaining scenario) has the power to unintendedly(!) kill all processes

Open jnpkrn opened this issue 6 years ago • 12 comments

Follow:

fs_stop() {
        local SUB=$1 timeout=$2 sig cnt
        for sig in TERM KILL; do
                cnt=$((timeout/2)) # try half time with TERM
                while [ $cnt -gt 0 ]; do
                        try_umount $SUB &&
                                return $OCF_SUCCESS
                        ocf_exit_reason "Couldn't unmount $SUB; trying cleanup with $sig"
                        signal_processes $SUB $sig
                        cnt=$((cnt-1))
                        sleep 1
                done
        done
        return $OCF_ERR_GENERIC
}

In case of bind mount and in rare circumstance that try_umount $SUB fails (e.g. because the source of the bind mounting was itself a mount point that hung for some reason), resource agent will proceed to kill every process touching the same filesystem as the target directory of this bind mount (as detected by fuser -m), but at that point, it boils down to any process touching the root mount point of /, which is practically the whole world (if not for anything else, every non-chrooted process under Linux will touch that very filesystem because of possessing a reference to it as its "root").

That's very bad, since this is a misbehaviour and only the cluster resource manager shall take decisions about what to do on the failed stop action (if that indeed shall be the expected outcome in this very case).

Case in point (thanks @computer045 for the report): https://github.com/ClusterLabs/pcs/issues/197

jnpkrn avatar Mar 14 '19 16:03 jnpkrn

To add, cannot exclude this is not specific to bind mounts, but that's the context it was observed with.


Simple yet possibly non-portable or otherwise problematic solutions:

  1. pass -M to fuser as well

  2. use stat output to compare "device identifiers" for given alleged mount point and that of / -- do not proceed if match (does anybody know of a counterexample where it would actually be desirable? I was thinking about namespaces ~ chrooting, but then, majority of the agents is likely flawed in some aspect anyway, so wouldn't follow this rabbit hole)


Also, for a better auditability, the

sending signal TERM to: root      3035     1  0 Mar07

log lines (how unwieldy the above looks without squeezing the white space, at least), there should also be the reason, e.g.,

sending signal TERM for holding /net/home filesystem handle to ...

It's really a dangerous thing for a privileged process to decide going on killing spree towards other processes based on the only but arbitrarily wide criterion, extreme care + good auditing desirable.

jnpkrn avatar Mar 14 '19 18:03 jnpkrn

Have you tried with force_unmount=safe? You could also set it to false.

oalbrigt avatar Mar 15 '19 08:03 oalbrigt

This may help, however the degenerate case of burning everything down shall rather be prevented in any case, since causality precludes it would ever be the right thing to do -- the only case that pops out is that the Filesystem resource runs prior to cluster stack starting and and it's started from such a mount point (undesirably), and that case could be detected thanks to parent-process chain going from the agent's process up to the cluster resource manager (hitting pacemaker-execd, then PID 1 in case of pacemaker 2+, and none of these shall carry handles towards these mounts directly in sane configurations).

jnpkrn avatar Mar 15 '19 12:03 jnpkrn

EDIT: the root cause detailed in the next comment: https://github.com/ClusterLabs/resource-agents/issues/1304#issuecomment-473525495 but this is the next real threat there, which would likely be triggered at some point, too, were there not the root issue.

~~Actually, the problem here is that the Filesystem agent doesn't satisfy unpronounced re-entrancy requirement as discussed recently:~~

https://lists.clusterlabs.org/pipermail/users/2019-February/025454.html

To be pointed out this was rather a naive original proposition (OTOH first grounding claim about the matter of this sort at all!):

So in practice, I think that's the guideline: an agent should be re-entrant for distinct sets of parameters, but not necessarily any one set of parameters.

Knowing the problem at hand (see below), let's refine this constraint further:

Agent shall take all precautions for it to be re-entrant, i.e., the parallel execution of the agent even with non-identical sets of user-configured configured parameters[*] and regardless of the respective actions being performed will not cause any adversary effects. It's the agent's responsibility to figure out semantic-based interference amongst such executions in that case, and apply suitable mesures to prevent them.

For instance, when the resource agent at hand is meant to establish a mount, pinning device/path A to the existing filesystem path B, its instance shall, upon its start or stop action (operations hazarding an interference) resolve the A over possibly transitive chain of bind mounts to the very source backing device, use that to derive a lock name, and try taking the lock, resorting to retest-situation-and-possibly-proceed-again waiting otherwie. This will prevent possible uncoordinated and destructive interference whereby the race between checking the mount is active and actual unmounting in one instance of the agent can be interleaved with the same execution of the parallel instance, making the unmount fail for the former, possibly leading to undesired failover consequences. At the most conservative case, there can be just a single lock to be used by whatever agent's instance, but this will, undesirably, amplify timeout-based issues and reduce parallelism needlessly. Finally, the solution could be offloaded onto configuration tools, but that shatters the point-of-authority centralized abstraction around the particular thing to control, which the agents happen to be in this arrangement.

[*] resource manager complying with OCF XY is mandated to prevent parallel execution of the agent instances with the identical sets of user-configured parameters (or ditto applied exclusively on the unique-group ones)


Full circle back...

https://github.com/ClusterLabs/pcs/issues/197#issuecomment-472046029 is effectively a subtle configuration error (two in practice identical mount managers established, in parallel execution stomping on each other's foot), the one we cannot reasonably require user to spot/realize (reliability means also resilience to human error), so the agent (here Filesystem needs to gain this broken-config-proofing on behalf of the users, with possible solutions sketched in the above example accompanying the proposal for extending the OCF standard. Partial solutions suggested in the 2nd comment are rather a last resort solution that, in an ideal case, would never be exercised and hence could be omitted.

Other sorts of agents are eligible for similarly critical review in this regard.

jnpkrn avatar Mar 16 '19 11:03 jnpkrn

Actually, referenced issue is not necessarily of that "parallel execution" sort, but it's a generally valid concern nonetheless, for "self-inflicted failure to start/stop" if the two parallel actions go in the opposite direction, etc.

The very root cause of the referenced case is a situation like this with a time progression:

mount -o bind A B:
device               path A <---bind-mount---- path B

---

mount device A

expected situation:
device <---mount---- path A <---bind-mount---- path B

per /etc/mtab:
device <---mount---- path A <---bind-mount---- path B
+
                 +-------------------mount-------+
                 |                               |
device <---------+   path A <---bind-mount---- path B

---

umount B

expected situation:
device <---mount---- path A                    path B

per /etc/mtab:
device               path A <---bind-mount---- path B

--> i.e. B still mounted ~ failure to stop
    (isolated within agent, attempts to recover by
    killing all the processes pertaining filesystem
    of (likely) A (which is not mounted, hence likely
    that of root of the tree)

since when mount -o bind A B will run ahead of mount device A, the net result is that umount B (against the expectations of the resource manager) has the same meaning as umount A (stopping A will cause a stopped B under the hood, but that's not an immediate concern since the killing spree has already begun and the resource manager will not be notified about this since it's killed as well unless processes touching root filesystem are assurably never killed as proposed), and said parallelism may only help to circumvent the logic further.

The other part of thinkable configuration sanity enforcement (that would get useful if either force_unmount=false or when the proposed sanity preservation is in place) is that bind mounts shall only be allowed in the resource manager managed mount chains (as in with enforced ordering, moreover with clear matches between the paths in the chain, with cardinality enforcements and such), not in an isolation (where its usage would generally be unrestricted).

Rgmanager as a cluster resource manager was superior in this regard, allowing to model such chains concisely and (IIRC) with exactly such sanity enforcements (implicit validation of resource stackability).

Pacemaker is more universal but lacks all this implicit guidance whatsoever, and resource agents themselves is a wrong level of abstraction to make decisions based on (resource manager specific) introspection. Now, perhaps some arrangement like that of rgmanager could be retrofitted partially, but without putting the physical configuration format in line with the way of thinking about the resources (as done with rgmanager), this won't get the decent adoption, I am afraid.

(sorry if it's all rather chaotic, don't have weeks to examine this to the tiniest detail, it's enough that something went wrong with allegedly legitimate configuration and we have failed to prevent the arising disaster with enough of sanity guarding)

jnpkrn avatar Mar 16 '19 12:03 jnpkrn

Immediate solution here is likely along the lines of https://github.com/ClusterLabs/resource-agents/issues/1304#issuecomment-473005913 since destroyed world including its master is the worst outcome possible at any scenario.

Regarding wider surface, locking (journaling) like approach could possibly help here if it enforced that umounts are silently skipped if there are other bind mount inflicted possesions of the same underlying device. Perhaps there are flags to mount to deal with this as well, but it would then be a portability issue.

And finally, I think systemic solution is needed, will OCF help to enforce law & order on the inter-agent(-instances) scales? https://github.com/ClusterLabs/OCF-spec/issues/22

jnpkrn avatar Mar 16 '19 15:03 jnpkrn

Perhaps it'd be enough if mounting over an existing bind mount point failed as unsupported, but that could cause a regression in some extreme corner cases. Discretion preventing the agent to become the world destroyer is likely in order regardless, as there's hardly a counter-example intention to do that -- and if there is, it is solvable with non-default force_unmount=strict.

jnpkrn avatar Mar 17 '19 22:03 jnpkrn

Wow, that's a nasty bug.

krig avatar Mar 18 '19 09:03 krig

I guess we could do a findmnt -n --target <mount> | grep -q "\[" to check if it's mounted on a bind mount point.

@jnpkrn Did you mean safe in your last comment force_unmount=strict? Anything but true or safe is treated as false for force_unmount.

oalbrigt avatar Mar 18 '19 10:03 oalbrigt

On 18/03/19 03:54 -0700, Oyvind Albrigtsen wrote:

@jnpkrn Did you mean safe in your last comment force_unmount=strict? Anything but true or safe is treated as false for force_unmount.

Actually I meant a new value in the enumeration that would preserve the current (allegedly wrong in this very case) semantics, moving the default (true) towards current behaviour sans killing accesses related to the root filesystem if the decision ends up like that (shall rather not, but...).

Since it's "unsafe behaviour can be understood as explicitly safer", it's not too troubling direction regarding the compatibility, IMHO, but versioned parameters can be used if it's of any importance.

-- Jan (Poki)

jnpkrn avatar Mar 18 '19 13:03 jnpkrn

Sounds like this patch might solve this issue: https://github.com/ClusterLabs/resource-agents/pull/1362

oalbrigt avatar Jun 24 '19 10:06 oalbrigt

Vaguely, it would only solve the / rooted scenario, which is just a subset of the general case, I think (have also some comments on the "how" for said #1362, more discussion there).

jnpkrn avatar Aug 26 '19 19:08 jnpkrn