chronos icon indicating copy to clipboard operation
chronos copied to clipboard

Jobs are scheduled in wrong order if one job failed once

Open wndhydrnt opened this issue 9 years ago • 7 comments

Given the following job graph where job Parent is a scheduled job. Child1 takes longer to execute than Child2. After both jobs have finished, End is triggered to run.

         +--------+         
         |        |         
    +----+ Parent +----+    
    |    |        |    |    
    |    +--------+    |    
    |                  |    
    |                  |    
    |                  |    
+---+----+        +----+---+
|        |        |        |
| Child1 |        | Child2 |
|        |        |        |
+---+----+        +----+---+
    |                  |    
    |                  |    
    |                  |    
    |    +--------+    |    
    |    |        |    |    
    +----+  End   +----+    
         |        |         
         +--------+         

When Parent is executed, the order of execution looks like this:

1: Parent (success)
2: Child1 (success) - Child2 (success)
3: End (success)

On the next execution of Parent, imagine that Child2 fails and the order of execution now looks like this:

1: Parent (success)
2: Child1 (success) - Child2 (fails)
3: End (not scheduled)

Now Child2 recovers but End is being executed before Child1 has finished (remember that Child1 takes longer than Child2):

1: Parent (success)
2: Child2 (success)
3: End (success)
4: Child1 (success)

Expected: End should only run after Child1 has finished. This last order of execution persists until either Child1 has failed once too or Chronos is being restarted.

wndhydrnt avatar Sep 01 '15 08:09 wndhydrnt

@wndhydrnt thanks for reporting this.

cc @gkleiman

pierluigi avatar Sep 01 '15 09:09 pierluigi

@pierlo-upitup @gkleiman I've been trying to fix this myself but without success. All things I tried introduced new bugs like jobs being scheduled too often or not at all in certain scenarios. Most of my solutions did not work because Chronos does not constrain the parents of a job, as it is possible to depend on a parent job and on the parent job of that parent job at the same time.

Any ideas?

wndhydrnt avatar Sep 09 '15 13:09 wndhydrnt

@wndhydrnt We could add the idea that a dependent job will only run once all of its parents jobs have at least one success and none of the parent jobs are running. If this is acceptable, this can be done by updating the edgeInvocationCount map in JobGraph.scala at the start of every task to update the number of successful invocations to be 0 for that job. This way Child2 will not kick off End on completion if Child1 had started / is still running.

I've run into this bug as well as it can be caused by updating a child job during a dependency chain as well when a parent has already finished but another parent job is still running during the update of the child job.

A potential problem is that if a child has parent jobs where at least one is always running for a child then the child would never be triggered. I find this to be unlikely however.

Califax avatar Dec 01 '15 03:12 Califax

@wndhydrnt did you ever end up (eventually) getting a fix? (like with https://github.com/Jimdo/chronos/commit/6015035ce085aec7323722da3b22327f80e27eb4)

If so, how about a PR?

solarkennedy avatar Apr 27 '16 17:04 solarkennedy

@solarkennedy No, I never was able to fix it without introducing new bugs. Sadly, to me, it looks like the whole logic that handles dependencies between jobs needs to be rewritten in order to fix it. Instead of doing that, I would suggest you use Airflow.

wndhydrnt avatar Apr 27 '16 18:04 wndhydrnt

Is this issue unresolved?

zxcv551133 avatar Jan 21 '20 08:01 zxcv551133

@zxcv551133 yes, still unresolved as of v3.0.2 (which doesn't appear in the releases but does in the tags and is an available Docker image, see https://hub.docker.com/r/mesosphere/chronos/tags). As outlined in https://github.com/mesos/chronos/issues/543#issuecomment-215184318 potential fixes appear to be non-trivial.

moertel avatar Jan 21 '20 10:01 moertel