toil icon indicating copy to clipboard operation
toil copied to clipboard

Allow to assign a task to a specific node type via placement hints

Open serut opened this issue 4 years ago • 7 comments

We need to add the ability to schedule a task that can be executed only if the agent fits some placement hints. I create this issue after some nice chat on Gitter.

You can see it as marking some node as "blue" and some others "red". I want to create some tasks that shall be executed on "red" nodes, and some others that shall be executed on "blue" and "red" nodes. This feature would help the CNES to build software above Toil completly agnostics from placement hints.

Rationale

We've a single mesos that aggregates several types of agents, two in fact. Almost all nodes connected to mesos are pretty basic, but few of them have a special binary installed that basic servers cannot have. All these servers are dedicated to the mesos cluster, but a task must be forwarded on the right agent otherwise it won't find the binary installed. I can't just use the docker image to provide the correct environment to the task success, as the server (mesos agent) is not located on the same "datacenter". We do not want to register two Toil with specific role, as we want to discuss with only 1 Toil that can distribute tasks on the right mesos resource depending of its placement hints. Currently, we cannot express an affinity between the task we submit using Toil and the agent that will receive the task whereas Kubernetes and Mesos provides a simple API to do it.

Cluster implementation

What's next ?

My company can work on this feature, but we first need to identify:

  • the list of impact in the code - does it breaks some optimisation ?
  • can we use CWL to store this information ?

Any help would be much appreciated !

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-806

serut avatar Feb 17 '21 11:02 serut

* can we use CWL to store this information ?

For your use-case ("some nodes have a special binary, others don't"), I would model that in CWL as an entry with the name of the special binary in a SoftwareRequirement and then provide that list to Toil.Job when we initialize the CWLJobs

New logic should be written elsewhere in Toil to route jobs based upon these hints. Eventually CWL will have a hint for GPU and other hardware requirements, so this will be very useful!

I would recommend "hacking" the logic in at first, and later we all can decide if there should be a plugin system, or a textual configuration system, or both!

mr-c avatar Feb 18 '21 08:02 mr-c

I would be nice on the Toil side, but I doubt that software requirement can be expressed in terms of Kubernates or Mesos assigment...

@ArtRand kindly proposed to make a POC of that design, which will be a very good start to show us the impact of that change.

serut avatar Feb 18 '21 10:02 serut

I doubt that software requirement can be expressed in terms of Kubernates or Mesos assigment...

It would be a site-speciic mapping of the software requirement name or identifier to k8s label or the mesos equivalent, yes?

mr-c avatar Feb 19 '21 10:02 mr-c

I'm not a big fan about "software requirement", it's more about providing criteria used by the orchestrator to send the task on the right place. It would be more powerfull if we can provide a list of label that needs to be respected as placement hints required by the task. The design won't be as simple on AWS (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html) and Azure, but I don't think this is related to software.
If I take the AWS example, I would create a template that is a EC2 m2.large with specific setup (disk...), and when I launch the task I provide the ID of the instance template that needs to be used to spawn the task.

serut avatar Feb 22 '21 10:02 serut

The idea behind using SoftwareRequirement and the ResourceRequirement is to make the information as portable as possible, so that the workflow is not too specific to a particular setup.

If those don't provide enough information to do the resource match making, then you can make your own hint extension. Here are some examples: https://github.com/common-workflow-language/common-workflow-language/issues/323

@serut Can you join the CWL video chat tomorrow to discuss this further? 16:30 Central European Time @ https://meet.jit.si/cwl

mr-c avatar Feb 22 '21 11:02 mr-c

I may struggle with these labels, but from an outside user, I would expect that :

  • SoftwareRequirement provides a way to force the node to contains a specific executable before being runned. More like an apt install call before running the task itself
  • ResourceRequirement would specify the node size in terms of CPU, RAM, GPU...

I would add to this :

  • ClusterSpecificRequirement: a list of key=value that ensures the node that received the task has some specifics characteristics.

All right ! See you tomorrow

serut avatar Feb 22 '21 12:02 serut

@mr-c Has CWL added/will it soon add the sort of escape hatch for runner-specific node label hints that is being asked for here?

adamnovak avatar Jan 24 '25 18:01 adamnovak