solo5
solo5 copied to clipboard
hvt_drop_privileges() behaviour
#276 adds scaffolding for hvt_drop_privileges()
and an OpenBSD pledge
implementation.
hvt_drop_privileges()
is called by the tender just before entering the VCUP loop. i.e.
- after all host resources have been acquired (all modules set up)
- guest ELF has been loaded and memory set up.
We should decide what the behaviour should be on Linux and FreeBSD, to get "secure by default" behaviour. At this time I'm not prepared to add actual seccomp()
filtering or Capsicum here, but I think that the behaviour should include at least:
- Droppng privileges to an unprivileged user if the tender is running as
root
. -
chroot()
to a suitable directory.
@hannesm: Presumably the FreeBSD case is easier, since there will be a suitable non-privileged UID always available and /var/empty
could be used for the chroot?
For the Linux case, I'm not sure what the best default behaviour that is guaranteed to work cross-distribution is -- could nobody
be used as the unprivileged user? I don't know of any equivalent of /var/empty
, but there might be something in the LSB/FHS.
A second case is when this is run out of a container in Linux -- can we guarantee that a nobody
will always be available there?
/cc @adamsteen
@mato sounds sensible for FreeBSD /cc @sg2342 who may have an opinion here
i think for FreeBSD a call to cap_enter(2) would be the way to go.
turns out, that cap_enter(2) will not work because because ppoll(2) is not permitted in capability mode. However: the poll(2) code in the FreeBSD kernel is capsicum enabled and shares the relevant parts with the ppoll(2). So the only thing missing in the FreeBSD source tree is an entry for ppoll(2) in sys/kern/capabilities.conf (and make -C sys/kern/ sysent to regen the syscall configuration).
I've been thinking about what to do here for the Linux case. There are two cases where the behaviour should be quite different:
-
In a non-containerized setup: unlike the BSDs there should be no need to run the hvt tender as root. The only thing the tender needs access to is
/dev/kvm
, which is easily granted through normal permission bits on the device file. - If the tender is running in a container (irrespective of container runtime): It is legitimate to run as root as we can assume it's not "real root", and in a minimal, fully deprivileged container where the tender is the only process running there's not much point in doing anything else (e.g. chroot).
Therefore, I think that for the first case (classic system) we should require running as non-root and make it the user's responsibility to ensure that access to /dev/kvm
is available (via being a member of the kvm
group -- I believe some distros may even enable this by default for everyone). No chroot()
-ing should be done, as there is no sensible default.
In the second case, this requires defining what "running in a container" means and then being able to detect it. I'd prefer to avoid various heuristics (see e.g. here: https://github.com/genuinetools/bpfd/blob/master/proc/proc.go) and instead explicitly support only the case where the tender is running as the sole PID 1, which implies it's inside a PID namespace, which to me seems a "good enough" way to detect if it is running in a "container". So, if and only if:
if (getpid() == 1 && getppid() == 0)
returns true, we allow running as root
(probably easiest to test just with getuid() for now rather than CAP_SYS_ADMIN) and do not do anything else, as in this setup there's likely no point in (or meaningful default for) a chroot()
.
Will try to pass this by some Linux container experts for opinions...
On the FreeBSD side, it looks like we might have to revert this entirely (see #312).
Update on the Linux side -- I'm no longer convinced we should do anything along the lines of classic privilege dropping there. Rather, with the introduction of "spt" (#310) we should look into applying a seccomp sandbox to the hvt tender also.
Ok, so, in the light of:
- Discussion in #316, the real fix here is as described in https://github.com/Solo5/solo5/pull/316#issuecomment-464757104 (fixing the FreeBSD vmm APIs).
- The build system refactoring in #326 changes the way hvt modules work, and the
HVT_DROP_PRIVILEGES=0
case will effectively be non-functional until I re-design that to not use compile-time flags. However, in the mean time, I need the tests to pass!
I'm going to make an executive decision here, which is: The FreeBSD privilege dropping introduced in #286 will be reverted on master
shortly.
Also, I'm going to close all issues/PRs related to this except #282, since the multiple discussions are confusing. We can discuss how to proceed forward there.
Note to self: TODO: The existing VM cleanup code for FreeBSD and OpenBSD should be audited and at warn()
should be used in the atexit handler(s) if any syscalls fail.
/cc @sg2342 @hannesm
#366 implements a capsicum(4) sandbox for Solo5/hvt on FreeBSD 12+. Removing the FreeBSD labels from this, as I consider this done there.