swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

Proposal: Device Support

Open dperny opened this issue 6 years ago • 56 comments

This is a rough overview of a proposed design for device support in Swarm. This is a possible implementation of #1244. The objective is to implement, in a way sensible to the cluster, support for devices. Please note that this is not yet on the road map; this is an early-stage proposal.

For community members, even if you don't or haven't contributed directly to swarmkit:

Does this meet or exceed the community's needs for device support? Is the UI flexible, ergonomic, and easy to use? Feel free to leave a comment explaining what is good and bad about this proposal.

Overview

Devices will be added as a first-class feature of swarm. The user will be able to define device classes, to which devices belong to. The user will be able to register devices on specific nodes, indicating to what class the device belongs to and what path the device is located at. The user can then specify device classes that a task needs to execute, and the swarmkit scheduler will assign a device to the task and place the task on the node with that device.

Goals

The goal of this proposal is to implement the most basic device-aware scheduling system, to swarmkit to fully support devices in a clustered environment.

Non-Goals

Non-goals of this proposal are to support things like security profiles or permissions. Additionally, though the device management workflow presented in this PR is a bit onerous and requires manual registration of devices, implementing automatic device detection and registration is out of scope.

Detailed Design

Data Model

The basic data model of devices is as follows:

  1. Device Classes represent a set of interchangeable and equivalent devices equally suited for scheduling. All devices belong to exactly one device class.
  2. Individual devices will be registered belonging to a class on a per-node basis. Once registered, a task may be assigned to use them.
  3. Task Specs will be updated to include the desired device classes and attachment options.

Devices are host-specific resources, but different devices on the same or different hosts may possibly be treated as interchangeable or equivalent. For example, many nodes in the cluster could possibly be attached to some GPU. Though the actual GPU on different nodes may be different, and there may even be more than one GPU per node, their functionality is equivalent, and any of these nodes is an equally suitable candidate for scheduling. Further, some devices should only be used by one task in the cluster, whereas others can be shared between as many tasks as needed.

Device classes are the object that represents the top-level concept of a device. Tasks can only specify devices in terms of device classes they desire. The specific device chosen is the prerogative of the swarmkit scheduler.

The individual devices available are a property of the node. A node may have as many devices specified as necessary. In keeping with the security pattern of not trusting workers, devices are always registered through the swarmkit manager, never self-reported or self-discovered.

Task Specs will include a list of device classes and options desired, including where in the task’s file system to place the device. Tasks must be prepared to accept any device in the class as equivalent. When a task is created, it will have the full run-time device parameter included in the object.

User Interface

Adding devices will introduce a new command and subcommands for the management of devices. The first command, and the biggest change, will be to add new subcommands to manage device classes:

Usage: docker swarm device COMMAND

Manage Swarm devices

Commands:
  add      Add a new device class to the swarm
  ls       List device classes on this swarm
  inspect  Show information about a device class and its devices
  rm       Remove a device class from the swarm

The add command adds a new device to the the swarm:

Usage: docker swarm device add [OPTIONS] CLASS

Add a new device class to the swarm

Options:
     --shared      Allow this class to be shared between tasks
     --label list  Set metadata on this device class

The ls command will allow listing all available device classes

Usage: docker swarm device ls [OPTIONS]

List device classes on this swarm

Options:
  -q, --quiet   Only display IDs
  -f, --filter  Filter output based on conditions provided

The inspect command will allow showing full information about device classes, as well as allowing the user to include all devices currently registered belonging to a device class.

Usage: docker swarm device inspect [OPTIONS] CLASS [CLASS]

Display detailed information one one or more device classes

Options:
  -f, --format string   Format the output
      --pretty          Print the information human friendly
      --devices         Include devices belonging to this class

The remove command is similar to all other rm commands, and its usage is obvious, with the caveat that removal of a device class will be disallowed if a device is in use by task. There is no update command, as device classes will not be treated as updateable.

To manage particular devices on nodes, the existing node update command will receive new flags:

--device-add device  Register a device on a node with the swarm
--device-rm device   Deregister a device with the swarm

Similar to other options like ports and volumes, devices will accept both short- and long-form versions.

The short form will take the format target:class, where path is the path of device on the host, and class is the device class to register with. as such

--device-add /dev/nvidia0:gpu

The long form of the command allows specifying these options independently, and allows future expansion of options for devices (such as host-specific cgroup options):

--device-add target=”/dev/nvidia0”,class=”gpu”

The device rm option for node update acts as expected, but will disallow removing a device that is in use.

Services would also support new flags. Service create will have a new option, --device, with both a long form and a short form. The short form will be reciprocal of the the --device flag on the node, taking the form class:path. It will also optionally support a third rwm field, mirroring the --device flag on docker run. The long form will take discrete arguments, and allow the user to specify cgroup options as supported in th

The short form, for mounting a GPU:

--device gpu:/dev/nvidia0

Services would also support a long form of the command:

--device class=”gpu”,path=”/dev/nvidia0”

Note: the long form of the command could possibly support further cgroup options, as allowed in the docker REST API for container creation.

Service update would include --device-add and --device-rm flags. --device-add syntax will be equivalent to the --device flag of create. Because a task may have more than one device of a class mounted into its running container, --device-rm would require both the class and path of the device to disambiguate the specific device that is to be removed.

--device-rm class=”gpu”,path=”/dev/nvidia0”

REST API

The Docker engine REST API would require a new set of endpoint to accommodate the concept of device classes. These endpoints would return the JSON representation of the objects described in the example Protocol Buffers. These endpoints would be as follows:

GET    /devices             List device classes
POST   /devices/create      Create a new device class
GET    /devices/{id}        Inspect a device class
POST   /devices/{id}/update Update a device class
DELETE /devices/{id}        Delete a device class

Protocol Buffers

In swarm, protocol buffers define the internal API and object structure.

The DeviceClass proto will form a new top-level type, like a Network or a Service. It will have an ID and a name.

// DeviceClass is a specification for a particular device, zero or more of
// which may be available on the cluster. It refers to the general class of
// devices that the user wishes to be assumed as interchangeably usable. For
// example, a cluster may have many possible block devices on many nodes, but
// any of them are valid. The specific implementation of a specific device on a
// node is provided by the node. A particular device may only belong to one
// device class.
message DeviceClass {
  string id = 1;

  Meta meta = 2 [(gogoproto.nullable = false];

  // Shared represents whether this device can be shared between many tasks, or
  // whether it should be uniquely mapped to a particular task. Shared devices
  // may have any number of tasks assigned to them.
  //
  // Note that Shared has strong security risks; shared devices may be used by
  // tasks to communicate with one another.
  bool shared = 3;
}

The Device proto is included as a repeated field on Node specs. It defines a particular available device belonging to a class.

// Device represents a particular available device on a node. It is one
// particular instance of a DeviceClass, and is interchangeable with other
// devices in the DeviceClass
message Device {
  // DeviceClassID is the ID of the device class that this device belongs to.
  string device_class_id = 1;

  // PathOnHost is the path in the host's filesystem that this device should be
  // mounted from.  For example, a block device may have this value as
  // "/dev/sda". A particular device may belong to only 1 device class;
  // assigning a device to more than one class may cause it to be conflictingly
  // scheduled.
  string path_on_host = 2;
}

The DeviceAttachmentSpec is a repeated field found in the TaskSpec proto, and defines the devices that a task should be attached to.

// DeviceAttachmentSpec represents the spec for a device attachment
message DeviceAttachmentSpec {
  // DeviceClass is the ID or name of the device class that is to be used for
  // this spec. The actual device may be any device of this class on any node.
  string device_class = 1;

  // Path represents the path in the task's filesystem that this device should
  // be mounted at.
  string path = 2;
 
  // DeviceCgroupRules represents the cgroup rules that should be applied to
  // this device.
  repeated string device_cgroup_rules = 16;
}

The DeviceAttachment is a repeated field on Tasks which defines specifically the run-time parameters of a device attachment for a particular task.

// DeviceAttachment represents the run-time configuration of a device in use.
// It includes both the path on the host and the path in the Task of the
// device, because a Task may have many devices of the same class reserved, and
// those reservations would be otherwise indistinguishable.
message DeviceAttachment {
  // DeviceClassID is the ID of the device class used for this device
  string device_class_id = 1;

  // PathOnHost is the path on the host's filesystem of the device to be used
  // by the task.
  string path_on_host = 2;

  // PathInTask is the path in the task's filesystem that the device will be
  // mounted at.
  string path_in_task = 3;

  // DeviceCgroupRules represents the Cgroup rules that should be applied to
  // this device.
  repeated string device_cgroup_rules = 16;
}

Swarmkit Implementation

The device allocator will be implemented as a sub-component of the Scheduler. It will be created when a scheduler is created, and keep track of the available devices in the cluster. Scheduling for available devices forms part of the constraint-solving portion of the scheduler.

Task updates present a difficulty for devices. If devices in the class can be shared between tasks (marked --shared), then there is not problem. However, the start-first update strategy would fail if there were not at least one device in a class available, such that the new task could start with a fresh device, allowing the old task to shut down and free its in-use device. There is no easy solution for this, I think. We should instead document thoroughly that using start-first with devices may cause trouble.

Error Handling

Because of the nature of distrusting the workers, it is difficult or impossible for swarm to “prove” that a given device exists on a node, or performs as the user expects. Swarm will therefore make no attempts to verify the correctness of provided user data. If a device is mistakenly assigned to the wrong class, or if it does not exist at all, the task is expected to fail to start. It should enter a terminal state of FAILED and should include an error message explaining that the errant device is at fault.

Notably, in this proposal, there will be no attempt to “downweight” or otherwise attempt to avoid a node with a failing device. This functionality may come later, but not as part of this proposal.

Security

It must be understood that once on the host, swarmkit has no control over how a task uses devices. If improperly used, devices can be an extreme security hole for swarm tasks. For example, mounting block devices may allow read or write access to all of their contents. If the host’s primary block device were mounted into a task, that task could have full access to the host filesystem.

About Generic Resources

Swarmkit currently includes a feature called “Generic Resources”, which serves to allow scheduling based on kinds of resources. The design doc for Generic Resources [2] outlines their use, which overlaps with the use case of this proposal. Specifically, Generic Resource already keeps track of resources which are available and in use on a cluster.

However, GenericResource has a notable deficiency: it lacks context about the runtime usage of a particular reserved resource. Essentially, a task is only informed of a resource at runtime, and the swarmkit worker has no way to know how to make use of a particular resource, which makes the feature quite useless.

The obvious solution would be to include in the TaskSpec instructions for how to make use of a resource. However, this puts the information about how to use a resource separate from the information about what resource is required. A TaskSpec might, for example, request in its ResourceReservations 3 GPUs, but in its ContainerSpec in a hypothetic Devices field, only use 2 of them, leaving 1 wasted. Or, alternatively, a TaskSpec might include instructions for mounting an audio device, but not include a reservation for one. This means that run time checks would be needed to make sure that the requested resources match the runtime instructions for using resources. Instead, this proposal uses the type system to make this kind of mismatch impossible to express.

Additionally, we cannot simply annotate or augment the GenericResource type in the task resource reservations, because the same type is shared between the TaskSpec (requested resources), the Task itself (assigned resources), and the Node (available resources). The same type is used to express which resources are available, which resources are assigned, and which resources are requested. However, these types all serve different purposes. Available resources don’t need to be aware of how they should be used by a task and requested resources can’t be aware of what resource will be assigned. This means that fields on the GenericResource would either mean different things in different places, or there would only be a subset of fields in use on any given object.

[2] https://github.com/docker/swarmkit/blob/de950a7/design/generic_resources.md

dperny avatar Jul 02 '18 22:07 dperny

While I cannot claim to know anything about the implementation or protocols, I can say that this is a desperately needed feature for any sort of IoT development for which current solutions (however clever) are insufficient. +1 due to that.

The user interface that's proposed also seems fairly intuitive. My question is, would this then support docker-compose files?

connormcmk avatar Jul 02 '18 22:07 connormcmk

I don't have a design for compose support, but I imagine it would be straightforward. You would just include devices in a service definition, like you do networks or ports. Something like this (very rough, not part of the proposal):

version: '3'
services:
  iot:
    ports:
     - "5000:5000"
    volumes:
     - .:/datastore
    devices:
    - target: sensor
      path: /dev/sensor

The only open question is whether a compose file should also be able to define device classes and devices per node. That's a better question for the compose team, after we've passed this phase of design.

dperny avatar Jul 02 '18 22:07 dperny

@dperny I like the plan, would be great to see this!

connormcmk avatar Jul 05 '18 18:07 connormcmk

@dperny This would cover our needs for using hardware security modules in containers. I cannot find anything wrong in the proposal.

apollo13 avatar Jul 07 '18 10:07 apollo13

I'm... kind of a doofus? And totally forgot that swarm supports Generic Resource constraints, design doc here: https://github.com/docker/swarmkit/blob/master/design/generic_resources.md

This work, which everyone seems to have forgotten even happened, handles the difficulty of managing which resources are in use on which nodes and by which tasks, which is the more complicated part of this proposal.

However, there is a big problem with the generic resources: the resource availability is decoupled at the data model from the way the resource is used. Essentially, you can keep track of which and how many resources a node has, but not how to actually make use of those resources. This is an explicit non-goal of the Generic Resource design. Quote,

As swarmkit is not responsible for exposing the resources to the container (or acquiring them), it needs a way to communicate how many generic resources were assigned (in the case of discrete resources) or / and what resources were selected (in the case of sets).

The reference implementation of the executor exposes the resource value to software running in containers through environment variables. The exposed environment variable is prefixed with DOCKER_RESOURCE_ and it's key uppercased.

This implies that tasks should be responsible for requisitioning their own resources at run time. However, this is impossible for devices. A task, from within a container, cannot attach devices after it has started. So the task has an awareness of what resources are available to it, but no actual way to make use of them. This basically explains why nobody uses this feature; the only way to do so would be to create tasks mounting the docker socket that spawn new containers.

The executor will have to be aware of how devices are accessed for devices to work. The responsibility for putting those devices into the task will have to live entirely within the agent.

I'll need to rewrite this proposal to accommodate this existing GenericResource feature, so we don't have two overlapping features with different but similar purposes.

dperny avatar Jul 09 '18 21:07 dperny

I'm poking at how to leverage the existing GenericResource code, and it's honestly not that sensible. The use case is too different. The amount of mogrification to the GenericResource concept that one would have to do is untenable.

Honestly... GenericResource isn't a super sensible implementation anyway. It totally decouples a task's resource demands from the actual use of resources, which is a serious problem. If a Task reserves a resource, but does not have any way to use it, the resource is wasted. However, if a Task specifies how to use a resource, but no such reservation was made, then the Task will fail in strange ways.

I think, despite the slight duplication of efforts, the use case for actually using devices is sufficiently different to warrant a separate design.

dperny avatar Jul 09 '18 22:07 dperny

Updated the design document to include section on GenericResource

dperny avatar Jul 09 '18 23:07 dperny

@dperny I would love to see this implemented! This would allow us to proper use hardware security modules (HSM) which are required by our application in swarm mode.

mbonato avatar Jul 12 '18 06:07 mbonato

@dperny Any update on progress for those of us who are eagerly waiting?

connormcmk avatar Aug 22 '18 18:08 connormcmk

Yes, I'm gonna do it, I just keep getting pulled away on other things internally. But it's gonna happen. Soon™.

dperny avatar Aug 22 '18 19:08 dperny

I Swoon for Soon™

On Wed, Aug 22, 2018 at 1:00 PM Drew Erny [email protected] wrote:

Yes, I'm gonna do it, I just keep getting pulled away on other things internally. But it's gonna happen. Soon™

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/docker/swarmkit/issues/2682#issuecomment-415142832, or mute the thread https://github.com/notifications/unsubscribe-auth/AVCjgz_19uYpBT75ivNe4rmYycHG6Rr7ks5uTapCgaJpZM4VAAKU .

connormcmk avatar Aug 22 '18 22:08 connormcmk

@dperny Any updates or timeline? Thanks!

connormcmk avatar Sep 20 '18 16:09 connormcmk

is any progress about this issue?

swift1911 avatar Nov 17 '18 11:11 swift1911

Seems not ^^

flopon avatar Feb 13 '19 16:02 flopon

i had a bunch of free time for a little while, and then it rapidly became not a bunch of free time, and now i'm doing other things. i'm really sorry, i started promising this a year ago and i feel The Guilt over not delivering on it.

dperny avatar Feb 13 '19 17:02 dperny

@dperny Please do not feel any guilt. No matter how snarky the comments from people like @flopon are (and I am sure he didn't mean to put any pressure on you), without throwing loads of money towards you there is no right to expect any progress.

Please do not ever feel bad for not delivering on a ticket on an (mostly) OSS project. Your work is highly appreciated and please do not let any comments get your motivation down!

apollo13 avatar Feb 13 '19 18:02 apollo13

i mean, i am having loads of money thrown at me, it's just being thrown at me to work on other features.

dperny avatar Feb 13 '19 19:02 dperny

No problem @dperny.

I would really like this feature to be developed, which in my opinion would make swarm the orchestrator of choice for IoT, but I'm not going to put the pressure on you. If it is not possible we will do otherwise, as always.

But I admit that I'm a little worried about the lack of attention on swarm: as I tend, for example, to validate the choice not to implement the "depends_on" function in swarm, here I have more the impression that swarm is no longer a priority project.

I hope that I am wrong, and that we will have news of this feature in a while!

By then, good luck for your current projects!

flopon avatar Feb 14 '19 13:02 flopon

@flopon I moved my stuff to the Balena platform and it has been awesome for me. It’s still all the docker goodness but with proper support for devices, multicontainer orchestration, and it even makes flashing the device easy. It’s worth checking out.

connormcmk avatar Feb 16 '19 17:02 connormcmk

do you have Device supporting (new device like GPU, FPGA, etc) solution for single machine, without cluster? this is simple scenario for Device supporting.

chenglin-li avatar Feb 20 '19 08:02 chenglin-li

I would also like to see this feature. I would like one of my containers in my service stack to have access to i2c. Still researching whether I can leverage k8s to achieve similar results. At most I have found https://github.com/kubernetes-sigs/cri-o/pull/1882 which was in V1.13.0 of cri-o which seems to support "additional devices", but I'm not familiar with cri-o or how it interfaces with k8s.

slic77 avatar Mar 15 '19 19:03 slic77

Since #1030 will soon be available does it make sense to tackle this next? Having services that may need special access to devices makes scheduling a lot easier

prologic avatar Jun 13 '19 04:06 prologic

I know too little about swarm mode for now to know if this has been covered in this thread.

Let say a node has 10 serial ports (/dev/ttyO{0..9}).

Would this proposal let me spawn 10 containers that would each grab 1 device from this device pool and start working?

fgervais avatar Sep 27 '19 20:09 fgervais

Do we have any Docker-EE customers that require this? If so, can you raise it through your special channels?

Unfortunately, my only use case is Plex using the /dev/dri for hardware encoding.

jnovack avatar Oct 27 '19 12:10 jnovack

Not an EE customer, but I'd also like to see this. I'd like to run my home automation platform in swarm mode, but currently it can't get at the zwave dongle.

smokes2345 avatar Nov 08 '19 00:11 smokes2345

I would definitely want this feature. Currently this blocks docker stack deployment for most use cases that requires device mapping :(

harrier-lcc avatar Nov 17 '19 03:11 harrier-lcc

To all:

Balena is a great solution and is open source.

On Sat, Nov 16, 2019 at 8:58 PM Harry Lee [email protected] wrote:

I would definitely want this feature. Currently this blocks docker stack deployment for most use cases that requires device mapping :(

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/docker/swarmkit/issues/2682?email_source=notifications&email_token=AFIKHA5OQO2VVBYMDORKVDDQUC6OTA5CNFSM4FIAAKKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEIBGCI#issuecomment-554701577, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIKHA34TLEYPVIJNEWXWN3QUC6OTANCNFSM4FIAAKKA .

connormcmk avatar Nov 18 '19 00:11 connormcmk

Why are talking about balena ? Banela doesn't even have swarm support.

unixfox avatar Nov 18 '19 06:11 unixfox

@dperny I'm afraid to ask, but any updates?

deepio avatar Mar 17 '20 13:03 deepio

it's not on the roadmap but i haven't forgotten about it. volume support is at the forefront, and if that proceed much faster than expected, then i'm planning to lobby to do this next

that said, you could build it, if you wanted to!

dperny avatar Mar 17 '20 13:03 dperny