swarmkit Be mindful of image architectures when scheduling tasks

Oct 24 '18 20:10 wk8

Based on conversation with @wsong: our real desire here is to run a service only on windows nodes with named pipe support rather than image architecture; what we really need here is scheduling based on named pipe support.

Oct 24 '18 20:10 anshulpundir

Thanks for the clarification @anshulpundir

Oct 24 '18 21:10 wk8

what we really need here is scheduling based on named pipe support.

If we implement it this way, the solution becomes far less general.

Can we implement this based on platform support in the multi-arch image manifest? If we do that then this scheduling smarts will be generally useful for anyone using multi-arch, and not unique to just this named pipe usecase.

Oct 25 '18 00:10 dhiltgen

@dhiltgen @anshulpundir @wsong : do I understand correctly that the goal here is to ensure that we only schedules tasks on nodes with the right architecture & OS according to the image's manifest?

If so, it looks like this is already the case: when creating a service, by default (and except if the --no-resolve-image flag was set, or if platforms were explicitly set) the client already does retrieve the manifest from the registry and passes that to swarm (https://github.com/moby/moby/blob/12bba16306dd618883fe2e85ef1730efc572294f/client/service_create.go#L45 and https://github.com/moby/moby/blob/12bba16306dd618883fe2e85ef1730efc572294f/client/service_create.go#L66-L71). (and we do the same on updates, https://github.com/moby/moby/blob/12bba16306dd618883fe2e85ef1730efc572294f/client/service_update.go#L44-L78)

This is then used by swarm when filtering which nodes are eligible to run the task (https://github.com/docker/swarmkit/blob/bc032e24784ea618044ee438fedec3458abb2ef9/manager/scheduler/filter.go#L253-L320).

So I'm not sure what the problem/goal here is. Could you please clarify?

Thanks!

Nov 06 '18 01:11 wk8

@wk8 The problem today is if Swarm can't get the manifest from the registry (e.g. if it's a private image or if we don't have Internet connectivity). In that case, Swarmkit won't be able to tell what platforms the image supports, and therefore it won't be able to filter the nodes.

One way of solving this would be to add support for loading image manifests from an offline bundle (i.e. a .tar file); that isn't possible today. See https://github.com/docker/docker-core-backlog/issues/24

Nov 06 '18 17:11 wsong

@wsong : thanks for the quick reply! :)

Just to make sure I understand correctly: currently, the client does query the registry for an image's manifest if there's a registry, but in the case of local multi-arch images loaded with docker load doesn't actually inspect those to extract the supported platforms (the same way docker run does to decide which bundled image to actually run); and that's what we'd want to change - correct?

Nov 06 '18 20:11 wk8

My understanding is that docker save doesn't actually save the manifest data for an image at all; it just saves the image contents. Thus, you can use docker load to load the image data for your platform, but you won't know what other platforms the image is available for.

Nov 06 '18 20:11 wsong

@wsong this can be stupid question but why you guys don't just focus to get docker manifest command marked as non-experimental so users can store manifest to local registries?

Nov 06 '18 20:11 olljanat

@olljanat That's basically what we're talking about here. What we need is a user-friendly way of shipping manifests to users without Internet connectivity.

Nov 06 '18 20:11 wsong

@wsong : tried this morning, it seems that you can docker save/load a manifest, seems to work just fine. If that's indeed the case, and we add support within swarm to be able to use a local multi-arch image, then we're good, right?

Nov 06 '18 21:11 wk8

(@wsong : the small caveat is that it seems that you do need to push the images in the manifest, and the manifest itself, to a registry before you can save it - once you've saved it you don't need access to the registry on whichever node you then load it on. Not sure why, but I see no reason why this couldn't be changed too)

Nov 06 '18 21:11 wk8

Ugh please ignore my previous comment https://github.com/docker/swarmkit/issues/2770#issuecomment-436415198. Manifests are indeed not saved at all. So what we need is a way to save manifests, I guess, i.e. bundle a manifest and its images in a single archive. Would we want a docker manifest save command? Where load then only imports the images that can run on the host?

Nov 06 '18 22:11 wk8

Actually, what we want is to just save the manifest (i.e. the metadata about which platforms a multi-arch image supports). For our purposes, we want to run a service on Windows nodes that are Windows Server 1709 or newer, but we want to create the service on a Linux node. That means that the Swarmkit manager (which will run on Linux) needs to have the manifest image metadata for the Windows images, but it doesn't need to load the actual image layers.

Nov 06 '18 22:11 wsong

@wsong have you seen this one? https://medium.com/@mauridb/docker-multi-architecture-images-365a44c26be6

As far I understand this you should be able to store manifest to private registry using commands:

export DOCKER_CLI_EXPERIMENTAL=enabled
docker manifest push registry-host:5000/org/image:tag

Nov 06 '18 23:11 olljanat

That's fine, but we still need an offline solution for air-gapped clusters.

Nov 06 '18 23:11 wsong

@wsong sorry but now I didn't understood. You need anyway some method to distribute images and manifest for swarm nodes so why you are not just running private registry inside of that air-gapped cluster?

Nov 06 '18 23:11 olljanat

Not all of our airgapped users have a private registry.

Nov 06 '18 23:11 wsong

Not all of our airgapped users have a private registry.

On that case how correct nodes gets image? From some external process? Why they are not then checking during import process that only nodes with correct OS version gets that image?

For our purposes, we want to run a service on Windows nodes that are Windows Server 1709 or newer, but we want to create the service on a Linux node.

Btw. One option which we are using for similar use cases are add engine/node labels to nodes and use constraints based on them.

PS. Let me know if I'm bothering your working too much. This just sounds interesting use case so I'm trying to understand it.

Nov 07 '18 00:11 olljanat

We just use docker load to get images onto airgapped nodes. But, as discussed above, that only loads image data, not manifests.

Nov 07 '18 16:11 wsong

(@wsong : the small caveat is that it seems that you do need to push the images in the manifest, and the manifest itself, to a registry before you can save it - once you've saved it you don't need access to the registry on whichever node you then load it on. Not sure why, but I see no reason why this couldn't be changed too)

It looks to be that docker manifest create command creates manifest locally and then docker manifest push is used to push it to Docker Hub/private registry.

So adding two new commands docker manifest save and docker manifest load would be probably best option.

@wsong would it work on your use case that you would use docker manifest load on swarm managers and docker load on worker nodes?

And just as reference. Current manifest commands are implemented on https://github.com/docker/cli/pull/138 and when all known issues on them have been fixed we can mark it non-experimental by merging https://github.com/docker/cli/pull/1355

Nov 08 '18 06:11 olljanat

@olljanat Yeah, that's what I was thinking; some sort of docker manifest save/load command.

Nov 08 '18 16:11 wsong

Some more info which can be useful:

When you create manifest with command docker manifest create it will be stored to folder .docker/manifests/ under user's profile.
You can also download manifests without images from Docker Hub/private registry using this script https://github.com/TomasTomecek/download-manifest-from-dockerhub

But that is fully client side implementation so to be usable on here it would need to be modified as engine side solution.

Other thing which I noticed that swarmkit does not currently care about OS version even if that is specified on manifest, only CPU architecture and OS: https://github.com/docker/swarmkit/blob/master/api/types.proto#L82-L88 On process isolation mode Windows only support to run images which have same OS version than on node so if that information exists on manifest is should be used.

I also tested to save Windows image to tar and import back with docker load and looks that manifest with needed information is included to it:

"Architecture": "amd64",
"Os": "windows",
"OsVersion": "10.0.17134.345",

One options is also add support for node.platform.os_version constraint so user can use it together with node.platform.os to make sure that services are scheduled only valid hosts. Nice thing on that option is that it would work on airgapped environment without need to import images/manifests to swarm managers.

Nov 08 '18 20:11 olljanat

After more discussion, going to add a OS version constraint.

The manifest approach, while fitting the UCP use case, was also deemed clunky and half-baked as a generic feature, mainly because in the offline (ie, no registry) case it would have required all the nodes in the clusters to locally have the same version of all the manifests, as well as all the workers to locally have all the images relevant to their arch/OS listed in all the manifests, while providing no mechanism to actually handle this synchronisation nor the deployment of new manifest/image versions.

On the other hand, adding OS version constraints is easy for users to understand, and also perfectly addresses the UCP use case.

Nov 09 '18 00:11 wk8

@wk8 Was this ever merged in Moby? it looks like this is a dependency for Kube on Windows.

Apr 25 '19 23:04 david-yu

Looks to be still pending https://github.com/moby/moby/pull/38349

Apr 26 '19 00:04 thaJeztah

Will do next week, thanks :)

Apr 29 '19 22:04 wk8

@wk8 Should we close this issue since it looks like the work got merged?

Aug 01 '19 21:08 david-yu

@david-yu : the needed part in moby/moby got merged, but I still need to do the swarmkit partn sorry. Will do shortly.

Aug 05 '19 10:08 wk8

swarmkit swarmkit copied to clipboard

Be mindful of image architectures when scheduling tasks

swarmkit
swarmkit copied to clipboard