resource-agents
resource-agents copied to clipboard
RA: Filesystem: bind mount point unmounting postcondition check failure (for being member of 2+ mounts in misordered chaining scenario) has the power to unintendedly(!) kill all processes
Follow:
fs_stop() {
local SUB=$1 timeout=$2 sig cnt
for sig in TERM KILL; do
cnt=$((timeout/2)) # try half time with TERM
while [ $cnt -gt 0 ]; do
try_umount $SUB &&
return $OCF_SUCCESS
ocf_exit_reason "Couldn't unmount $SUB; trying cleanup with $sig"
signal_processes $SUB $sig
cnt=$((cnt-1))
sleep 1
done
done
return $OCF_ERR_GENERIC
}
In case of bind mount and in rare circumstance that try_umount $SUB
fails (e.g. because the source of the bind mounting was itself
a mount point that hung for some reason), resource agent will proceed
to kill every process touching the same filesystem as the target
directory of this bind mount (as detected by fuser -m), but at
that point, it boils down to any process touching the root mount
point of /, which is practically the whole world (if not for anything
else, every non-chrooted process under Linux will touch that very
filesystem because of possessing a reference to it as its "root").
That's very bad, since this is a misbehaviour and only the cluster
resource manager shall take decisions about what to do on the failed
stop action (if that indeed shall be the expected outcome in this
very case).
Case in point (thanks @computer045 for the report): https://github.com/ClusterLabs/pcs/issues/197
To add, cannot exclude this is not specific to bind mounts, but
that's the context it was observed with.
Simple yet possibly non-portable or otherwise problematic solutions:
-
pass
-Mtofuseras well -
use
statoutput to compare "device identifiers" for given alleged mount point and that of/-- do not proceed if match (does anybody know of a counterexample where it would actually be desirable? I was thinking about namespaces ~chrooting, but then, majority of the agents is likely flawed in some aspect anyway, so wouldn't follow this rabbit hole)
Also, for a better auditability, the
sending signal TERM to: root 3035 1 0 Mar07
log lines (how unwieldy the above looks without squeezing the white space, at least), there should also be the reason, e.g.,
sending signal TERM for holding /net/home filesystem handle to ...
It's really a dangerous thing for a privileged process to decide going on killing spree towards other processes based on the only but arbitrarily wide criterion, extreme care + good auditing desirable.
Have you tried with force_unmount=safe? You could also set it to false.
This may help, however the degenerate case of burning everything down
shall rather be prevented in any case, since causality precludes it
would ever be the right thing to do -- the only case that pops out is
that the Filesystem resource runs prior to cluster stack starting and
and it's started from such a mount point (undesirably), and that case
could be detected thanks to parent-process chain going from the agent's
process up to the cluster resource manager (hitting pacemaker-execd,
then PID 1 in case of pacemaker 2+, and none of these shall carry
handles towards these mounts directly in sane configurations).
EDIT: the root cause detailed in the next comment: https://github.com/ClusterLabs/resource-agents/issues/1304#issuecomment-473525495 but this is the next real threat there, which would likely be triggered at some point, too, were there not the root issue.
~~Actually, the problem here is that the Filesystem agent doesn't satisfy
unpronounced re-entrancy requirement as discussed recently:~~
https://lists.clusterlabs.org/pipermail/users/2019-February/025454.html
To be pointed out this was rather a naive original proposition (OTOH first grounding claim about the matter of this sort at all!):
So in practice, I think that's the guideline: an agent should be re-entrant for distinct sets of parameters, but not necessarily any one set of parameters.
Knowing the problem at hand (see below), let's refine this constraint further:
Agent shall take all precautions for it to be re-entrant, i.e., the parallel execution of the agent even with non-identical sets of user-configured configured parameters[*] and regardless of the respective actions being performed will not cause any adversary effects. It's the agent's responsibility to figure out semantic-based interference amongst such executions in that case, and apply suitable mesures to prevent them.
For instance, when the resource agent at hand is meant to establish a mount, pinning device/path
Ato the existing filesystem pathB, its instance shall, upon itsstartorstopaction (operations hazarding an interference) resolve theAover possibly transitive chain ofbindmounts to the very source backing device, use that to derive a lock name, and try taking the lock, resorting to retest-situation-and-possibly-proceed-again waiting otherwie. This will prevent possible uncoordinated and destructive interference whereby the race between checking the mount is active and actual unmounting in one instance of the agent can be interleaved with the same execution of the parallel instance, making the unmount fail for the former, possibly leading to undesired failover consequences. At the most conservative case, there can be just a single lock to be used by whatever agent's instance, but this will, undesirably, amplify timeout-based issues and reduce parallelism needlessly. Finally, the solution could be offloaded onto configuration tools, but that shatters the point-of-authority centralized abstraction around the particular thing to control, which the agents happen to be in this arrangement.[*] resource manager complying with OCF XY is mandated to prevent parallel execution of the agent instances with the identical sets of user-configured parameters (or ditto applied exclusively on the
unique-groupones)
Full circle back...
https://github.com/ClusterLabs/pcs/issues/197#issuecomment-472046029
is effectively a subtle configuration error (two in practice identical
mount managers established, in parallel execution stomping on each
other's foot), the one we cannot reasonably require user to
spot/realize (reliability means also resilience to human error), so
the agent (here Filesystem needs to gain this broken-config-proofing
on behalf of the users, with possible solutions sketched in the
above example accompanying the proposal for extending the OCF standard.
Partial solutions suggested in the 2nd comment are rather a last
resort solution that, in an ideal case, would never be exercised
and hence could be omitted.
Other sorts of agents are eligible for similarly critical review in this regard.
Actually, referenced issue is not necessarily of that "parallel execution" sort, but it's a generally valid concern nonetheless, for "self-inflicted failure to start/stop" if the two parallel actions go in the opposite direction, etc.
The very root cause of the referenced case is a situation like this with a time progression:
mount -o bind A B:
device path A <---bind-mount---- path B
---
mount device A
expected situation:
device <---mount---- path A <---bind-mount---- path B
per /etc/mtab:
device <---mount---- path A <---bind-mount---- path B
+
+-------------------mount-------+
| |
device <---------+ path A <---bind-mount---- path B
---
umount B
expected situation:
device <---mount---- path A path B
per /etc/mtab:
device path A <---bind-mount---- path B
--> i.e. B still mounted ~ failure to stop
(isolated within agent, attempts to recover by
killing all the processes pertaining filesystem
of (likely) A (which is not mounted, hence likely
that of root of the tree)
since when mount -o bind A B will run ahead of mount device A,
the net result is that umount B (against the expectations of
the resource manager) has the same meaning as umount A
(stopping A will cause a stopped B under the hood, but
that's not an immediate concern since the killing spree
has already begun and the resource manager will not be
notified about this since it's killed as well unless
processes touching root filesystem are assurably never
killed as proposed), and said parallelism may only help to
circumvent the logic further.
The other part of thinkable configuration sanity enforcement
(that would get useful if either force_unmount=false or
when the proposed sanity preservation is in place)
is that bind mounts shall only be allowed in the resource
manager managed mount chains (as in with enforced ordering,
moreover with clear matches between the paths in the chain,
with cardinality enforcements and such), not in an isolation
(where its usage would generally be unrestricted).
Rgmanager as a cluster resource manager was superior in this regard, allowing to model such chains concisely and (IIRC) with exactly such sanity enforcements (implicit validation of resource stackability).
Pacemaker is more universal but lacks all this implicit guidance whatsoever, and resource agents themselves is a wrong level of abstraction to make decisions based on (resource manager specific) introspection. Now, perhaps some arrangement like that of rgmanager could be retrofitted partially, but without putting the physical configuration format in line with the way of thinking about the resources (as done with rgmanager), this won't get the decent adoption, I am afraid.
(sorry if it's all rather chaotic, don't have weeks to examine this to the tiniest detail, it's enough that something went wrong with allegedly legitimate configuration and we have failed to prevent the arising disaster with enough of sanity guarding)
Immediate solution here is likely along the lines of https://github.com/ClusterLabs/resource-agents/issues/1304#issuecomment-473005913 since destroyed world including its master is the worst outcome possible at any scenario.
Regarding wider surface, locking (journaling) like approach could
possibly help here if it enforced that umounts are silently skipped
if there are other bind mount inflicted possesions of the same
underlying device. Perhaps there are flags to mount to deal with
this as well, but it would then be a portability issue.
And finally, I think systemic solution is needed, will OCF help to enforce law & order on the inter-agent(-instances) scales? https://github.com/ClusterLabs/OCF-spec/issues/22
Perhaps it'd be enough if mounting over an existing bind mount point
failed as unsupported, but that could cause a regression in some
extreme corner cases. Discretion preventing the agent to become
the world destroyer is likely in order regardless, as there's
hardly a counter-example intention to do that -- and if there is,
it is solvable with non-default force_unmount=strict.
Wow, that's a nasty bug.
I guess we could do a findmnt -n --target <mount> | grep -q "\[" to check if it's mounted on a bind mount point.
@jnpkrn Did you mean safe in your last comment force_unmount=strict? Anything but true or safe is treated as false for force_unmount.
On 18/03/19 03:54 -0700, Oyvind Albrigtsen wrote:
@jnpkrn Did you mean
safein your last commentforce_unmount=strict? Anything buttrueorsafeis treated asfalseforforce_unmount.
Actually I meant a new value in the enumeration that would preserve
the current (allegedly wrong in this very case) semantics, moving the
default (true) towards current behaviour sans killing accesses
related to the root filesystem if the decision ends up like
that (shall rather not, but...).
Since it's "unsafe behaviour can be understood as explicitly safer", it's not too troubling direction regarding the compatibility, IMHO, but versioned parameters can be used if it's of any importance.
-- Jan (Poki)
Sounds like this patch might solve this issue: https://github.com/ClusterLabs/resource-agents/pull/1362
Vaguely, it would only solve the / rooted scenario, which is just
a subset of the general case, I think (have also some comments on
the "how" for said #1362, more discussion there).