runtime-spec Support for Windows job containers

See https://github.com/kubernetes/enhancements/pull/2288 for more background. To avoid any confusion here the name chosen for this container type for the cri API and the user facing k8s settings is HostProcess containers. Internally we've coined these as job containers but it's referring to the same type of container, we'd just like to keep the name the same as we use internally at the OCI level and in our code. The cri HostProcess field being set would be our key to fill in the WindowsJobContainer field on the runtime spec for example.

There's been asks for Windows privileged containers, or something analogous to it, for quite some time. While in the Linux world this can be achieved just be loosening some of the security restrictions normally in place for containers, this isn't as easy on Windows for many reasons. There's no such thing as just mounting in /dev for the easy example.

The model we've landed on to support something akin to privileged containers on Windows is to keep using the container layer technology we currently use for Windows Server and Hyper-V isolated containers, and to simply have the runtime manage a process, or set of processes, in a job object as the container. The work for job containers is open source and lives here: https://github.com/microsoft/hcsshim/tree/master/internal/jobcontainers

For an example of the behavior of running a job container, if you went to run a container using image X and rootfsMountPoint wasn't specified, the full rootfs you'd see in a normal process or hypervisor isolated container will be mounted on the host at a path determined by the runtime (hcsshim really). The entire image will be a new volume on the host sitting at C:\path\determined\by\runtime, so C:\ in a normal windows container for that image would now be located at C:\path\determined\by\runtime on the host. If the rootfsMountPoint field IS getting set to something, all that changes is where the volume is mounted to. The init process (and any processes it launches or that the user explicitly execs) will be run in a job object and not in a server silo which is the usual Windows isolation mechanism.

The container will be able to see and do anything a normal process would, where whatever user the container is running as would determine what it has access to. All of the usual resource limits, environment variables and what user to run the container as from the runtime spec are still utilized, although now the user choice is just whatever password-less accounts exist on the host rather than in the container image, otherwise the default is to inherit the token of whatever process launched the container.

This approach covers all of the use cases we've currently heard that privileged containers would be useful for. Some of these include configuring network settings, administrative tasks, viewing/manipulating storage devices, and the ability to simplify running daemons that need host access (kube-proxy) on Windows. Without these changes we'd likely set an annotation to specify that the runtime should create one of these containers, which isn't ideal.

As for the one optional field, this is really the only thing that actually differs/isn't configurable for normal Windows Server Containers. With job containers the final writable layer (volume) for the container is mounted on the host so it's accessible and viewable without enumerating the volumes on the host and trying to correlate what volume is the containers. This is contrary to Windows Server Containers, where the volume is never mounted to a directory anywhere, although it's still accessible from the host for the curious.

Signed-off-by: Daniel Canter [email protected]

Mar 29 '21 23:03 dcantah

cc @kevpar @katiewasnothere @anmaxvl @ambarve

Mar 29 '21 23:03 dcantah

Looks like CIs failing because of the dockerhub rate limit 🙃

"Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit The command "docker pull vbatts/pandoc" failed and exited with 1 during."

Mar 29 '21 23:03 dcantah

Looks like CIs failing because of the dockerhub rate limit 🙃

"Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit The command "docker pull vbatts/pandoc" failed and exited with 1 during."

Looks like #1078 will fix this.

Mar 30 '21 00:03 kevpar

The organizational bits of this seem fine to me, but I feel like I'm missing something on the exact details of what these are, and what the specific difference is between these and process / hyper-v isolation containers.

From what I've been able to understand so far, it sounds like these might not actually be containers at all, but really just host processes managed by the container runtime with an optional container image rootfs mounted to the host somewhere for the process to access?

(https://github.com/kubernetes/enhancements/pull/2288/files#r572380563 was somewhat useful, especially this bit: "Job objects have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the correct permissions, among other host resources.")

Perhaps it would help to have a more concrete example -- if we took one of the Windows variants of https://hub.docker.com/_/golang and wanted to "run" it as a job container, what would that mean? Would that be setting rootfsMountPoint to something like C:\golang-image and then my "container" process is C:\golang-image\go\bin\go.exe ? What happens to the environment variables, etc of the image? :confused:

I'm trying to imagine this in the context of a basic runtime like the experience during docker run or ctr run and I'm having a hard time picturing how this would work (probably because it's not designed for doing that at all), but that's got me wondering whether it even belongs in the OCI runtime specification, since it's not really a container at all. :sweat_smile:

Apr 07 '21 18:04 tianon

@tianon Agreed this doesn't fit the general "container" definition very well, this is just the model we'd landed on for our version of a privileged container. It still does utilize parts of the container stack (the filesystem filter we use for container layers) but I'd be hard pressed to call that a container also if we're going by the common understanding the community has of what a container is and should be.

So if you took that golang image and the rootfsMountPoint wasn't specified, the full rootfs you'd see in a normal process or hypervisor isolated container will be mounted on the host at a path determined by the runtime (hcsshim really). So the entire image will be a new volume on your host sitting at C:\path\determined\by\runtime. So C:\ in a normal windows container for that image would now be located at C:\path\determined\by\runtime on the host. The environment variables in the runtime spec should still be set on the init process. If the rootfsMountPoint field IS getting set to something, all that changes is where that volume is mounted to.

We don't have any plans to support this for docker at the moment, although we do for containerd (I guess if docker on Windows ever starts going through containerd we indirectly will support docker also). These are mostly going to be used for managerial tasks, setup that needs to be performed on the host, running daemons on the host that today need to be deployed in some hacky way and as Windows services generally, stats collectors/loggers etc. There's a lot of places these shine but I understand the worry of putting something on the spec that doesn't fit the mold of what a container is understood to be today. Hopefully this was helpful 😅

Apr 07 '21 19:04 dcantah

@tianon Did that answer everything hopefully? Are there any objections that maybe we can work out?

Apr 14 '21 19:04 dcantah

@tianon Small ping on this 😄

Apr 20 '21 23:04 dcantah

Anyone able to give this a peek? @crosbymichael @vbatts @tianon

May 05 '21 20:05 dcantah

like tianon said. Your explination seems fine, and maybe ought to be included in the text. Can https://github.com/microsoft/hcsshim/tree/master/internal/jobcontainers (or a versioned ref) be a decent perma-link for more info to define this field?

May 14 '21 18:05 vbatts

@vbatts Thanks for taking a look! I've added a "run a container" example to the description. By a decent perma-link to define the field do you mean add a link in the comment for the WindowsJobContainer struct to somewhere that gives a rundown on the container type? I'm not sure if we'd want to link to the code directly but want to make sure I understand the ask. Ideally we could make a vanity url (aka.ms) to the code for now that we could update to point to the best spot as things evolve, does that sound reasonable?

May 18 '21 20:05 dcantah

@vbatts Ping on the above

May 24 '21 17:05 dcantah

@vbatts Small ping again on this 😬

Jun 08 '21 13:06 dcantah

Sorry for the delay -- I'm honestly still very confused why this is being proposed to https://github.com/opencontainers/runtime-spec instead of being a feature of kubelet or even a standalone tool. Given you don't imagine using it via Docker, I'm guessing you don't imagine it being implemented in runtimes like runc either?

My understanding of this so far is that it's essentially just a codified way to mount a container image filesystem at some given path at the host, potentially only visible to a single process (similar to the way a mount namespace works in Linux)?

Looking at this through the eyes of both a user and an image creator, I'm trying to understand how it would be used, and to that end I'd like to create a simple example scenario to try and help make sure I'm understanding correctly.

As an image author, I create something like the following:

FROM mcr.microsoft.com/windows/servercore:1809
RUN setx /m PATH "C:\container-tool;%PATH%"
ENV CONTAINER_TOOL_CONFIG C:\container-tool\container-tool.config
COPY container-tool.exe container-tool.config C:\container-tool\
CMD ["container-tool.exe"]

When I run this image as-is, it happily runs container-tool.exe from C:\container-tool, and correctly picks up the appropriate configuration file thanks to CONTAINER_TOOL_CONFIG.

If I run this as a "job" container, my understanding is that C:\ inside the "container" will be C:\ from the host, not from the image, and that the image contents will be mounted at either rootfsMountPoint or wherever the runtime feels like mounting it, and container-tool.exe won't necessarily even know where that is.

What will the value of PATH be?
Will CONTAINER_TOOL_CONFIG still be pointing at the (now wrong) C:\container-tool\container-tool.config path?
Will the runtime even be able to find container-tool.exe appropriately?

Maybe you can give an example of how one of these might be created via containerd's ctr tool? I feel like there's some major usability bit that I'm failing to grasp here, and I'm still left wondering why this is being proposed to the runtime-spec when from my understanding so far it's really a host process that happens to have access to a container/image filesystem, not a container/image process that has access to the host.

Jun 08 '21 14:06 tianon

@tianon And now sorry for the delay on my end 😓.

Given you don't imagine using it via Docker, I'm guessing you don't imagine it being implemented in runtimes like runc either?

If Docker ever ends up going through containerd instead of calling the hcs/hns methods itself then this would likely be supported. I'm not following the last bit however, Windows containers in general aren't implemented in runc either (maybe I'm misunderstanding).

My understanding of this so far is that it's essentially just a codified way to mount a container image filesystem at some given path at the host, potentially only visible to a single process (similar to the way a mount namespace works in Linux)?

Sort of, but it's not only visible to one process, it's viewable by any process on the host. Closest comparison is like if you just made an overlayfs and mounted it somewhere in the root mount namespace. It's essentially a codified/container workflow way to package and run a process. It's quite a departure from the Linux world of a privileged container where it's still a container in the ways we think of one.

To make another Linux analogy, a job container is like if you skipped all of the namespace aspects and just:

Made an overlayfs with the layers from a container image
Ran a process from the image and added it to a cgroup and set whatever limits you wanted.

You're correct in that the above image would not work as expected as C:\ will just be the host's C:\. An image author would have to adjust their image/make a new one for this type of container for most scenarios. We set an environment variable for every process in each container that points to where it's mounted on the host named CONTAINER_SANDBOX_MOUNT_POINT so they can use this to figure out where it's mounted (or provide in the commandline).

Anything set with ENV in a container image will get passed through as expected and set on a container process as it'll be in the OCI image config, but your setx example won't as we don't read the registry of an image for any set env vars. The reason for this is that it's not actually needed to bring around all of Windows for these containers anymore so parsing something from the image layers can't be a valid way to deal with that. We expect to have some way either to build an essentially empty image that an author could package binaries in for these (sort of like FROM scratch) but that's still in the works.

If you passed the container CMD as ["container-tool\container-tool.exe"] or ["%CONTAINER_SANDBOX_MOUNT_POINT%\container-tool\container-tool.exe"] these would both find the binary. Relative paths are parsed in relation to wherever the mount path is, and the second one is because like described this env var points to where it's mounted.

A main draw as mentioned is being able to run things that currently need to be maintained/scheduled in a custom way/run as Windows services. kube-proxy being a prime example. Your last comment rings true though, this really is more of a process that gets access to/is launched from a container image and not a container that can magically see the host.

Jun 21 '21 19:06 dcantah

It's quite a departure from the Linux world of a privileged container where it's still a container in the ways we think of one.

I think this is still my biggest disconnect/struggle here, because I'm still failing to see how this is a container?

Technically the spec currently allows for the mount namespace of a container to be optional, but specifying root.path is not optional, and several things (such as uid and gid) are defined in terms of "the container namespace" which references namespaces generally in the glossary but usually is really referring specifically to the container's mount namespace, which is rooted at root.path.

So what's being proposed here, while interesting, still doesn't seem to match any definition of "container" we currently support, hence my confusion and hesitation.

Jul 08 '21 23:07 tianon

@tianon Your concerns definitely aren't without reason 😬, it kind of abandons a lot of the usual assumptions. The only real similarities job containers bring over is the copy on write filesystem semantics and (possibly down the road) the network isolation mechanism on windows (network compartments), which HNS namespaces are an abstraction around. I think the main point here was to make the workflow as "complete" as it could be for launching one of these, and that for us means having an oci field to check to see that we're getting asked to make one of these things. Otherwise we'd need to check an annotation or use some other means.

I think Windows is a weird beast already in a lot of ways, as most of the namespace like aspects are all jumbled into the one silo object (except networking), and not split apart like on Linux.

Jul 09 '21 23:07 dcantah

runtime-spec runtime-spec copied to clipboard

Support for Windows job containers

runtime-spec
runtime-spec copied to clipboard