enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

[REP][Core]New scheduling feature: taints and tolerations

Open larrylian opened this issue 1 year ago • 0 comments

We plan to introduce the Taints & Tolerations feature to achieve scarce resource isolation, especially for GPU nodes, where preventing ordinary CPU tasks from being scheduled on GPU nodes is a key requirement that many ray users expect.

Key concepts:

Ray scheduling framework

Taints & Tolerations concepts

  1. If you don't want normal cpu task/actor to be scheduled on GPU node, You can add a taint to a gpu node(node1) using ray cli. For example:
# node1
ray start --taints={"gpu_node":"true"} --num-gpus=1
  1. Normal cpu task/actor/placement group will not be scheduled on GPU node.

The actor/pg will not be scheudled onto node1

actor = Actor.options(num_cpus=1).remote()
pg = ray.util.placement_group(bundles=[{"CPU": 1}])
  1. Then you want to schedule gpu task onto gpu node(node1), you can specify a toleration for task.

The actor/pg would be able to scheudled onto node1

actor = Actor.options(num_gpus=1, tolerations ={"gpu_node": Exists()}).remote()
pg = ray.util.placement_group(bundles=[{"GPU": 1}], tolerations ={"gpu_node": Exists()})

You can also use taints to achieve node isolation.

  1. If you want to isolate a node with memory pressure so that tasks are not scheduled onto it. You can use ray taint:
ray taint --node-id {node_id_1} --apend {"memory-pressure":"high"}

Then the new task/actor/pg will not be schedule onto node1.

  1. You can restore the node once the memory pressure on the node is reduced to a low level.
ray taint --node-id {node_id_1} --delete {"memory-pressure":"high"}

Then the new task/actor/pg will be able to schedule onto node1.

larrylian avatar Aug 28 '23 09:08 larrylian