nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Rootless Nomad

Open tgross opened this issue 2 years ago • 9 comments

Nomad client agents must be run as root. The notion of "rootless" containers has worked its way through the container ecosystem. This issue is a bit of a brain-dump to assemble some thoughts and discussion around running Nomad "rootless". Please note this isn't yet a roadmap item or even a promise that Nomad will ever support rootless operation. If we decide to pursue this direction, we'd then engage in a design process (RFC) before we could start work on this.

What is Rootless?

Rootless operation has several criteria:

  1. The container orchestrator (ex. Nomad client agent or k8s kubelet) is not running as root.
  2. The container runtime (ex. dockerd, podman) is not running as root.
  3. The root user inside the container cannot be mapped to the root user on the host.

User-namespace mapping (criteria 3) alone can already be done by Nomad for some task drivers, so this issue is primarily focused on running Nomad itself as an unprivileged user.

Why Rootless?

Container runtimes and orchestrators need to perform privileged operations normally reserved to the root user (or to a user that can escalate via sudo or doas):

  • Resource isolation via cgroups
  • Namespace isolation: mount, user, pid, ipc, and network namespaces
  • Allowing inbound and outbound network traffic to the workload's network namespace.

Therefore running rootless containers has two primary use cases:

  • Running containers as an unprivileged user for interactive use. For single-user machines like your typical developer laptop, being able to run sandboxed software as a normal user is handy. This use case doesn't have much overlap with Nomad, as Nomad tasks aren't interactive.
  • Reduce the attack surface of the container orchestrator and/or runtime. These are complicated pieces of software with network access. Running them as root exposes a large attack surface if the application is compromised. This is more interesting for us.

Requirements for Rootless

Given the set of privileged operations needed described above, there are some specific requirements for rootless containers:

  • Delegated cgroups: to safely set cgroups as an unprivileged user requires cgroups v2.
  • User namespaces: on some distros this may require setting sysctls like kernel.unprivileged_userns_clone=1
  • Creating veth pairs across network namespaces (currently implemented with CNI)
  • Creating iptables rules for inbound traffic
  • Task driver engine (ex. dockerd, podman, containerd, etc) must be configured for rootless operation. This requires cgroups v2 + user namespaces + either a patched kernel or kernel module (overlay.ko) allowing unprivileged overlayFS or a fuse overlay FS. While this is all the responsibility of the task driver engine, we'd probably need to document anything we intend to support here.

Nomad-specific quirks

Nomad supports a wide variety of task drivers, which may have their own "runtimes" that may not even be containers (ex. QEMU).

Because Nomad task groups can have mixed task drivers, Nomad has to split duties of setting up the task environment between the task driver and the rest of the client agent. For example, Nomad clients set up network namespaces, perform cpuset cgroup accounting, etc., but delegate bind-mounts to the task driver.

Nomad supports Windows and Mac! (Natively and not by running in a VM!) We definitely want to provide some exec-like isolation for Windows tasks in the future, so whatever we do here should not block off a path to doing so.

But Everyone Else is Doing It!

So how does everyone else do this? All the implementations I've been able to find combine required kernel and OS configuration, user namespaces, and either setuid binaries for networking or user mode networking.

User namespaces are unfortunately a bit half-baked. Even a cursory glance at recent CVEs (ex. CVE-2022-32250, CVE-2022-1055, CVE-2022-24122, CVE-2021-4197, CVE-2022-0185) illustrates the primary problem. Any vulnerability in user namespaces allows an attacker to escalate to full root. While administrators should decide for themselves whether user namespaces are appropriate for their threat model, we should approach this with caution from Nomad so that we're not encouraging their use by folks that assume they're perfectly safe.

Likewise, setuid binaries allow an unprivileged user access to root-privileged operations. This also means that if any unprivileged user on the machine is compromised, they can immediately escalate to some set of root. And if the setuid binary itself is compromised, the attacker owns the entire host. Ideally a setuid binary is well-scoped and well-audited, but because it can be run by an unprivileged user there are it may be easier to attack than an application running as root, especially if that application is in a memory-safe language. For single-user machines like developer laptops, this may not be an unreasonable tradeoff, but this may not be acceptable for production servers.

Among the common set of setuid binaries for rootless containers are the LXC project's lxc-user-nic and newuidmap and newgidmap leaned on by RootlessKit. The RootlessKit project also uses user-mode networking (via slirp4netns) to bypass the requirement for a setuid binary for networking.

Options

Here are some options to anchor a discussion around. These are in rough order of complexity, but aren't necessarily mutually exclusive either.

Documentation: Some administrators may want to accept giving up some features in exchange for rootless Nomad. We can document all the known kernel and OS configuration values, and document all the feature gaps that administrators will face with rootless Nomad.

Graceful Degradation: It may be that there are features that break running tasks entirely (ex. cpuset management comes to mind) under rootless Nomad. Identifying these and allowing for graceful degradation would help administrators who are ok with losing those features. One tricky bit with this is ensuring that none of the features are security sensitive and end up degrading silently! Another is that we'd probably need additional client fingerprinting to ensure tasks don't get scheduled on clients that can't support rootless operation.

Setuid Networking: Currently Nomad uses CNI to implement networking on Linux. We could move operations that require privileges to a setuid binary instead, such as lxc-user-nic. We already should probably document the CNI requirement and fail gracefully without it (this needs more fingerprinting on the client), but we'd need to do the same for a setuid binary. We'd almost certainly need to provide some sort of fallback for administrators who don't want it. And none of this works on Windows.

Multi-Process Nomad: The motivation for wanting rootless Nomad is to reduce privileges. Instead of providing "true" rootless operation, we could follow in the long-standing tradition of Unix applications and have Nomad fork itself into multiple processes, only one of which runs as root.

Nomad is already shipped as a single "multi-call" binary; it can run as a Nomad server agent, a Nomad client agent, as the Nomad CLI, as logmon, or as a docker log shim. The client agent can be further split into a process that runs as root and child processes that perform "riskier" tasks such as network IO with the server, downloading artifacts, rendering templates, etc.

Unlike setuid binaries, this approach would work equally well for Windows. We'd do something like call AdjustTokenPrivileges() with SE_PRIVILEGE_REMOVED set to drop privileges.

tgross avatar Jul 11 '22 13:07 tgross