job-scheduler icon indicating copy to clipboard operation
job-scheduler copied to clipboard

[FEATURE] Guarantee that jobs will not miss execution when moving shards

Open downsrob opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? The job scheduler uses a consistent hash function to form a hash ring to determine which job should be assigned to which node. in the case of cluster events like node add/remove and shard relocation (routing table update), all data nodes will be notified, then refresh the hash ring to schedule/deschedule job on local node accordingly. During this refresh, there is some time where the jobs are descheduled and then on the new node, the job scheduler sweeps the shard and reschedules the jobs for the next execution. If the job was supposed to execute in that gap between the deschedule and the reschedule, the execution would be skipped. As jobs are rescheduled by sweeping the entire shard again, the more jobs you have on the shard, the larger the lag and the greater the chance of a missed execution.

What solution would you like?

  • [ ] Update documentation on possible ways jobs could skip an execution for future developers
  • [ ] Look into providing a more strict guarantee that a job will not “miss” an execution, i.e. it might be delayed but it'll execute to make up for the missed one if we haven't overlapped with the next one yet

What alternatives have you considered? This isn't a new issue, and is very rare for most use cases. It may be that it isn't necessary to guarantee no missed executions.

Do you have any additional context? Add any other context or screenshots about the feature request here.

downsrob avatar Apr 21 '22 23:04 downsrob