ray
ray copied to clipboard
[Core] Enhancing node affinity scheduling feature through node labels
Description
This feature improves node affinity scheduling by allowing the addition of static labels to nodes, which are then used to determine affinity.
Ray Enhancement Proposals: https://github.com/ray-project/enhancements/pull/22
To help track the progress of this feature's development, I have subdivided it into several items. These items will be subject to modification based on actual development circumstances, and any suggestions for improvement are welcome.
1. API for Node Affinity Scheduling with Labels
API for setting node labels
- [x] (P1)Finalize the command-line API(ray start) for setting node labels. #35433
- [x] (P1)Finalize the format for setting node labels in the ray.init() interface of Python worker. #36007
- [ ] (P3)Finalize the format for setting node labels in the ray.init() interface of Java and C++ worker.
- [ ] (P4)Finalize the ray up API for setting node labels.
- [ ] (P4)Finalize the kuberay API for setting node labels.
API for using node labels
- [ ] (P2)Finalize the new node affinity scheduling with node labels API in the Python worker.
- [ ] (P3)Finalize the new node affinity scheduling with node labels API in the Java and C++ worker.
API for getting node labels
- [ ] (P1)Finalize the API for getting node labels in Python.
- [ ] (P3)Finalize the API for getting node labels in Ray Dashboard.
- [ ] (P4)Finalize the API for getting node labels in Ray command-line(ray status).
2. Internal Implementation
- [x] (P1)Parse the configuration parameters for node labels and save them in the NodeInfo data structure. #35433
- [ ] (P1)Finalize default node labels.
- [x] (P1)Synchronize the node labels information to the resources of all nodes. #36009
- [ ] (P2)Build an index table based on the labels information of all nodes to improve scheduling performance.
- [ ] (P2)Implement the node affinity with labels interface in Python and transparently transmit it to the CoreWorker.
- [ ] (P2)Implement the node affinity with labels scheduling policy.
3. Tests
- [ ] Implement basic test cases for Python.
- [ ] Add test cases for edge scenarios.
- [ ] Add test cases for various failover/abnormal scenarios.
- [ ] Add test cases for cross-language calls.
4. Adapting Java and C++ workers
- [ ] (P3)Implement the node affinity with labels interface in Java and transparently transmit it to the CoreWorker.
- [ ] (P3)Add test cases for the Java worker implementation.
- [ ] (P4)Implement the node affinity with labels interface in C++ and transparently transmit it to the CoreWorker.
- [ ] (P4)Add test cases for the C++ worker implementation.
5. Adapting Auto Scaling
- [ ] (P4)Add node labels information and node affinity with labels scheduling information to the API for AutoScaler and GCS interactions.
- [ ] (P4)Adapt the logic of the simulated scheduling module in the Autoscaler to implement node affinity scheduling with labels.
6. Visualization/Observable
- [ ] (P3)Display the labels information of nodes in the Ray dashboard.
7. Document
- [ ] (P5)Write documentation for using node affinity scheduling with labels.
(P1)Finalize the format for setting node labels in the ray.init() interface of Python worker. should be checked off because https://github.com/ray-project/ray/pull/36007 was merged.