volcano
volcano copied to clipboard
[LFX 2025 Term2]Enhance JobFlow Functionality
What is the problem you're trying to solve
The current JobFlow implementation in Volcano provides a powerful way to orchestrate multiple dependent batch jobs using JobTemplates and control flow primitives like sequential and parallel execution. However, there are several limitations that hinder its flexibility and ability to handle more complex real-world scenarios:
-
Limited JobTemplate Customization: Currently, when a JobTemplate is referenced within a JobFlow, it is used as is. Users lack the ability to make minor modifications to the JobTemplate's parameters for specific instances within the flow. This forces users to create multiple similar JobTemplates for slight variations, reducing reusability and increasing management overhead.
-
Lack of Robust Error Handling: When a job within a JobFlow fails, the entire flow might halt or require manual intervention. There is no built-in mechanism for automatically retrying failed jobs based on specific policies (e.g., maximum retries, backoff strategies). This reduces the robustness of JobFlow and can lead to longer overall execution times for complex workflows.
-
Insufficient Control Flow Options: The existing control flow primitives are basic. Supporting more advanced control structures like
ifstatements,switchstatements, andforloops would significantly enhance the expressiveness and capability of JobFlow, allowing users to define more intricate and dynamic job execution plans.
Describe the solution you'd like
-
Allow Modification of JobTemplate Parameters on Reference: Introduce a mechanism within the JobFlow specification that allows users to override or modify specific parameters of a JobTemplate when it is referenced within a JobFlow step. This could be achieved through a dedicated section in the JobFlow step definition where users can specify parameter changes (e.g., resource requirements, command-line arguments, environment variables) that apply only to that particular instance of the JobTemplate execution.
-
Implement Job Failure Retry Mechanism: Implement a configurable retry policy for jobs within a JobFlow. This would involve adding fields to the JobFlow step specification to define:
maxRetries: The maximum number of times a failed job should be retried.retryPolicy: The strategy for retrying (e.g.,Always,OnFailure,Never).backoffPolicy: (Optional) A strategy for delaying subsequent retries (e.g.,constant,exponential). The JobFlow controller would need to monitor job status and automatically trigger retries based on the defined policy.
-
Introduce Advanced Control Flow Statements: Extend the JobFlow specification to include support for common control flow statements:
ifstatement: Allow conditional execution of JobFlow steps based on the outcome (success/failure, completion status) of previous steps or potentially external conditions.switchstatement: Enable branching execution based on the value of a variable or the outcome of a previous step.forstatement: Facilitate iterative execution of a JobFlow step or a set of steps a specified number of times or based on a collection of items.
Additional context
No response
I'm Interested in working on this.
Hi @Monokaix
I'd like to tackle this feature to enhance the JobFlow functionality in Volcano. Here’s a breakdown of both the current limitations and the proposed solution:
The Problem: Right now, JobFlow orchestrates multiple dependent batch jobs well using JobTemplates and simple primitives for sequential or parallel execution. However, there are significant challenges hindering its scalability in real-world scenarios, we can significantly improve the robustness and flexibility of JobFlow, making it more suitable for complex and dynamic batch processing scenarios.
Could you please assign this issue to me? I’m eager to start drafting a PR and collaborating on these improvements.
Best regards, Abhijit
Hello @Monokaix , I have a question towards the first enhancement: Isn't the functionality of allow Modification of JobTemplate Parameters on Reference the same with the patch field in the jobFlow?
Hello @Monokaix, I am currently working on the issue and would like some clarification. Could you help explain the difference between the Job Failure Retry Mechanism in JobFlow and the built-in retry mechanism in VCJob itself?
/assign
@Monokaix add JobFlowTemplate ?