volcano icon indicating copy to clipboard operation
volcano copied to clipboard

[LFX 2025 Term2]Enhance JobFlow Functionality

Open Monokaix opened this issue 6 months ago • 5 comments

What is the problem you're trying to solve

The current JobFlow implementation in Volcano provides a powerful way to orchestrate multiple dependent batch jobs using JobTemplates and control flow primitives like sequential and parallel execution. However, there are several limitations that hinder its flexibility and ability to handle more complex real-world scenarios:

  1. Limited JobTemplate Customization: Currently, when a JobTemplate is referenced within a JobFlow, it is used as is. Users lack the ability to make minor modifications to the JobTemplate's parameters for specific instances within the flow. This forces users to create multiple similar JobTemplates for slight variations, reducing reusability and increasing management overhead.

  2. Lack of Robust Error Handling: When a job within a JobFlow fails, the entire flow might halt or require manual intervention. There is no built-in mechanism for automatically retrying failed jobs based on specific policies (e.g., maximum retries, backoff strategies). This reduces the robustness of JobFlow and can lead to longer overall execution times for complex workflows.

  3. Insufficient Control Flow Options: The existing control flow primitives are basic. Supporting more advanced control structures like if statements, switch statements, and for loops would significantly enhance the expressiveness and capability of JobFlow, allowing users to define more intricate and dynamic job execution plans.

Describe the solution you'd like

  1. Allow Modification of JobTemplate Parameters on Reference: Introduce a mechanism within the JobFlow specification that allows users to override or modify specific parameters of a JobTemplate when it is referenced within a JobFlow step. This could be achieved through a dedicated section in the JobFlow step definition where users can specify parameter changes (e.g., resource requirements, command-line arguments, environment variables) that apply only to that particular instance of the JobTemplate execution.

  2. Implement Job Failure Retry Mechanism: Implement a configurable retry policy for jobs within a JobFlow. This would involve adding fields to the JobFlow step specification to define:

    • maxRetries: The maximum number of times a failed job should be retried.
    • retryPolicy: The strategy for retrying (e.g., Always, OnFailure, Never).
    • backoffPolicy: (Optional) A strategy for delaying subsequent retries (e.g., constant, exponential). The JobFlow controller would need to monitor job status and automatically trigger retries based on the defined policy.
  3. Introduce Advanced Control Flow Statements: Extend the JobFlow specification to include support for common control flow statements:

    • if statement: Allow conditional execution of JobFlow steps based on the outcome (success/failure, completion status) of previous steps or potentially external conditions.
    • switch statement: Enable branching execution based on the value of a variable or the outcome of a previous step.
    • for statement: Facilitate iterative execution of a JobFlow step or a set of steps a specified number of times or based on a collection of items.

Additional context

No response

Monokaix avatar May 12 '25 12:05 Monokaix

I'm Interested in working on this.

Shivansh-yadav13 avatar May 17 '25 05:05 Shivansh-yadav13

Hi @Monokaix

I'd like to tackle this feature to enhance the JobFlow functionality in Volcano. Here’s a breakdown of both the current limitations and the proposed solution:

The Problem: Right now, JobFlow orchestrates multiple dependent batch jobs well using JobTemplates and simple primitives for sequential or parallel execution. However, there are significant challenges hindering its scalability in real-world scenarios, we can significantly improve the robustness and flexibility of JobFlow, making it more suitable for complex and dynamic batch processing scenarios.

Could you please assign this issue to me? I’m eager to start drafting a PR and collaborating on these improvements.

Best regards, Abhijit

Sukuna0007Abhi avatar May 18 '25 13:05 Sukuna0007Abhi

Hello @Monokaix , I have a question towards the first enhancement: Isn't the functionality of allow Modification of JobTemplate Parameters on Reference the same with the patch field in the jobFlow?

owenowenisme avatar May 22 '25 09:05 owenowenisme

Hello @Monokaix, I am currently working on the issue and would like some clarification. Could you help explain the difference between the Job Failure Retry Mechanism in JobFlow and the built-in retry mechanism in VCJob itself?

JackyTYang avatar May 25 '25 14:05 JackyTYang

/assign

mahdikhashan avatar Jun 07 '25 12:06 mahdikhashan

@Monokaix add JobFlowTemplate ?

dongjiang1989 avatar Jun 27 '25 06:06 dongjiang1989