alluxio
alluxio copied to clipboard
Failed load job is rescheduled when Leadership switch
Alluxio Version: v2.9.3
Describe the bug When using the LoadV2 version to load data (alluxio fs load xxxx --submit), if the task fails in the end, the job status will not be persisted in the journey. If there is a subsequent master-slave switch, the new Master will reschedule the previously stale failed Load job.In many production environments, rescheduling old failed tasks will cause a batch of unnecessary data to be loaded, thus greatly affecting cluster stability. Furthermore, from the original design, it seems that the failed job is not expected to be rescheduled. See "Additional context" for details, so this should be a bug.
To Reproduce
First, a load job is submitted by loadV2, and then the loadjob fails.
Second, switch the master, check the job status of the LoadJob , and find that the Job has been rescheduled.
Expected behavior After master-slave switching, failed jobs should not be rescheduled
Urgency Affects cluster stability after master-slave switching
Are you planning to fix it Yes
Additional context
First of all, please let us make it clear that the original design of this function is to hope that the job with a clear success or failure status will not be scheduled after the Leadership switch.
We have also encountered a similar issue, we observed that jobs originally in the JobState.STOPPED state would automatically trigger rescheduling after leader switch. For large jobs that have been explicitly cancelled, this mechanism can lead to unnecessary resource consumption and stability risks. To mitigate this issue, we have explicitly prohibited jobs in the JobState.STOPPED from being automatically rescheduled after leader switch