[LFX 2025 Term2]Implement Volcano Scheduler Simulator
What is the problem you're trying to solve
For users of Kubernetes and Volcano schedulers, the scheduling process is often a black box. Understanding how scheduling decisions are made and evaluating the functionality and performance of the scheduler, especially when introducing new scheduling features, can be challenging. Setting up a full-fledged Kubernetes cluster and generating realistic workloads to observe scheduling behavior can be resource-intensive and time-consuming. Users need a lightweight and efficient way to verify the correctness and performance implications of scheduler changes without the overhead of a real cluster. The kube-scheduler community has addressed this need with kube-scheduler-simulator, and the Volcano community would greatly benefit from a similar tool.
Describe the solution you'd like
-
Input Configuration: The simulator should accept configurations defining a simulated Kubernetes cluster state, including:
- Nodes (with their resources and labels/annotations).
- Pods (with their resource requests, scheduling constraints, labels/annotations, and queue assignments).
- Volcano-specific configurations (e.g., scheduler plugins, queue configurations, policies).
-
Simulation Engine: The core of the simulator should replicate the key scheduling stages of the Volcano scheduler, such as:
- Queue selection.
- Filtering of eligible nodes for a given pod.
- Scoring of eligible nodes based on configured scoring plugins.
- Binding of pods to the selected nodes.
-
Output and Reporting: The simulator should provide clear and informative output, including:
- The final scheduling decision for each simulated pod (i.e., the node it was assigned to).
- Detailed logs of the scheduling process for individual pods, potentially including:
- The list of nodes considered.
- The reasons why certain nodes were filtered out.
- The scores assigned to eligible nodes by different scoring plugins.
-
(Optional) Performance Evaluation: The simulator could optionally provide basic performance metrics, such as the time taken to schedule a set of pods under specific conditions. This could help in evaluating the performance impact of scheduler changes.
-
Usability: The tool should be easy to use, with clear command-line interface (CLI) or API, and should include comprehensive documentation and illustrative examples demonstrating how to simulate various scheduling scenarios and verify different scheduling policies.
Additional context
- Integration with existing Volcano testing frameworks.
- Support for simulating specific Volcano features (e.g., gang scheduling, task topology).
- Visualization of the simulated scheduling process (e.g., through logs or a simple UI).
The volcano-scheduler-simulator would be an invaluable tool for developers to test and debug scheduler changes, for users to understand and verify Volcano's scheduling behavior in different scenarios, and for the community to evaluate the performance characteristics of the scheduler.
Hey, this seems really interesting. Are there any pre-LFX discussions about the projects happening in community calls or elsewhere, would love to be a part of it
Hey, this seems really interesting. Are there any pre-LFX discussions about the projects happening in community calls or elsewhere, would love to be a part of it
Hi, thanks for your attention!We do have a discussion before but not totally same as before, see https://github.com/volcano-sh/volcano/pull/3822, in the LFX project this time, we want provide a simple binary and some inputs include pod, node, podgroup, queue, etc., and then the binary itself just call the main Scheduler command to simulate the scheduling process, and these scheduling processes and results can be observed, which includes the result and performance, etc.
Is there any progress on this? For example, I have 10,000 historical jobs, and I want to extract the data and simulate the scheduling process based on different scheduling strategy components to achieve the best expected results.