alluxio Failed load job is rescheduled when Leadership switch

Failed load job is rescheduled when Leadership switch

Open liiuzq-xiaobai opened this issue 10 months ago • 1 comments

Alluxio Version: v2.9.3

Describe the bug When using the LoadV2 version to load data (alluxio fs load xxxx --submit), if the task fails in the end, the job status will not be persisted in the journey. If there is a subsequent master-slave switch, the new Master will reschedule the previously stale failed Load job.In many production environments, rescheduling old failed tasks will cause a batch of unnecessary data to be loaded, thus greatly affecting cluster stability. Furthermore, from the original design, it seems that the failed job is not expected to be rescheduled. See "Additional context" for details, so this should be a bug.

To Reproduce First, a load job is submitted by loadV2, and then the loadjob fails. Second, switch the master, check the job status of the LoadJob , and find that the Job has been rescheduled.

Expected behavior After master-slave switching, failed jobs should not be rescheduled

Urgency Affects cluster stability after master-slave switching

Are you planning to fix it Yes

Additional context 企业微信截图_97e28ba5-51fb-49f2-8584-3c994ffb3c01 企业微信截图_bc1de378-908a-4315-b514-e8d6b0dd5a6d First of all, please let us make it clear that the original design of this function is to hope that the job with a clear success or failure status will not be scheduled after the Leadership switch.

Apr 18 '24 06:04 liiuzq-xiaobai

We have also encountered a similar issue, we observed that jobs originally in the JobState.STOPPED state would automatically trigger rescheduling after leader switch. For large jobs that have been explicitly cancelled, this mechanism can lead to unnecessary resource consumption and stability risks. To mitigate this issue, we have explicitly prohibited jobs in the JobState.STOPPED from being automatically rescheduled after leader switch

Aug 21 '24 02:08 hawthorn2025

alluxio alluxio copied to clipboard

Failed load job is rescheduled when Leadership switch

alluxio
alluxio copied to clipboard