enhancements
enhancements copied to clipboard
[REP][Core]New scheduling feature: taints and tolerations
We plan to introduce the Taints & Tolerations feature to achieve scarce resource isolation, especially for GPU nodes, where preventing ordinary CPU tasks from being scheduled on GPU nodes is a key requirement that many ray users expect.
Key concepts:
- If you don't want normal cpu task/actor to be scheduled on GPU node, You can add a taint to a gpu node(node1) using ray cli. For example:
# node1
ray start --taints={"gpu_node":"true"} --num-gpus=1
- Normal cpu task/actor/placement group will not be scheduled on GPU node.
The actor/pg will not be scheudled onto node1
actor = Actor.options(num_cpus=1).remote()
pg = ray.util.placement_group(bundles=[{"CPU": 1}])
- Then you want to schedule gpu task onto gpu node(node1), you can specify a toleration for task.
The actor/pg would be able to scheudled onto node1
actor = Actor.options(num_gpus=1, tolerations ={"gpu_node": Exists()}).remote()
pg = ray.util.placement_group(bundles=[{"GPU": 1}], tolerations ={"gpu_node": Exists()})
You can also use taints to achieve node isolation.
- If you want to isolate a node with memory pressure so that tasks are not scheduled onto it. You can use ray taint:
ray taint --node-id {node_id_1} --apend {"memory-pressure":"high"}
Then the new task/actor/pg will not be schedule onto node1.
- You can restore the node once the memory pressure on the node is reduced to a low level.
ray taint --node-id {node_id_1} --delete {"memory-pressure":"high"}
Then the new task/actor/pg will be able to schedule onto node1.