cylc-flow
cylc-flow copied to clipboard
Rethink runahead limiting
Runahead limiting was originally designed to limit the performance impact of task pool bloat caused by spread over cycles, in the old spawn-on-submit system where the intermediate cycles would fill up with succeeded tasks.
In other words it is really just a crude way of limiting task pool size - which we couldn't do properly under spawn-on-submit (the flow would stall if the wrong tasks were held back)
But it's a bad way of limiting task pool size because,
- it's no help within a cycle point (many of our worst flows have a large number of tasks per cycle)
- any default value is an unhappy compromise because
- flows with many tasks per cycle need a very low runahead limit (say, 1)
- flows with few tasks per cycle can have a very high runahead limit (say, 200)
In the spawn-on-demand scheduler:
- spread over cycles has no impact, except for individual tasks with no prerequisites and no other constraints (xtriggers)
- tasks are spawned "on demand" when ready to run according to the graph. This brings runhead limiting into line with xtriggers, task hold, and queues: it's just another way to temporarily hold back tasks that are otherwise ready to run
- so, runahead limiting under SoD is functionally equivalent to queue (that releases according to current runahead limit rather than number of active tasks)
- (it doesn't even preferentially release older cycle points first, the ordering that exists is due to spawning in cycle point order)
- (also, the global "default" queue is equivalent to a task pool size limit, because queues limit the number of active tasks, and in SoD the task pool is the active tasks - see (x) below)
~The one thing that is different from queues etc. is our use of the hidden runahead pool.~ Instead of a visible waiting/ready task held back by a queue, we have a ~hidden~ [visible] waiting/ready task held back by the runahead limit.
The points above suggest that we no longer need runahead limiting because we now have the ability to limit pool size properly even within a cycle. However, we will keep it for backward compatibility (many existing flows set a value, and some may rely on it as a proxy for intercycle dependence). But we should ~get rid of the hidden runahead pool and~ [done] unify runahead limiting with the other limiting mechansims: queues, xtriggers (now unified with clock triggers and retries), task hold. (And later we may re-implement all of these as xtriggers).
This was motivated by thinking about representation of limited (ready but held back as waiting) tasks in the Cylc 8 UI n-distance window
(x) pool size limiting does not cause race condition stalls in SoD if by pool we mean just the active tasks. If too many children are spawned by the active tasks, some will be held back from becoming active, but a stall won't result regardless of subsequent release order because all of them are ready to run (there are no unsatisfied tasks waiting to be satisfied by dependency matching with active tasks anymore -this was the source of the SoS race condition). Note that the number of spawned tasks is not limited, but the smaller the number of active tasks, the smaller the number of children they will spawn.