Dagger.jl icon indicating copy to clipboard operation
Dagger.jl copied to clipboard

Remove scheduler plugin machinery, make Sch programmable

Open jpsamaroo opened this issue 5 years ago • 1 comments

The scheduler plugin system is (to my knowledge) unused by any modern users of Dagger. However, the presence of the potential for multiple external schedulers makes changing the scheduler API technically a breaking change, which would slow Dagger's development were that actually a concern. Of course, we still want to support different kinds of scheduling algorithms and optimizations that best suit a user's use case and DAG structure, so we should make the Sch scheduler user-programmable by making the current worker pressure algorithm optional, and having schedule! call a user-defined callable.

The option to change the internal scheduler algorithm is intended to only be used by adventurous users who understand that deadlocks/livelocks/hangs/etc. are all possible when changing the default scheduler, although we should expose "safe" semi-internal APIs that can perform common tasks correctly (which the default scheduler should also use whenever possible). Hopefully this change can spur developers and adventurous users to experiment with (and contribute) new scheduling algorithms which are better than the default scheduler for certain classes of workloads, with Dagger itself becoming the foundation for user-defined scheduling of distributed Julia code.

jpsamaroo avatar Nov 17 '20 17:11 jpsamaroo

Some proposed degrees of freedom that could benefit from user-defined implementations:

  • Stager - Chooses whether to stage input objects into Thunks, or to execute them with a special implementation (#173 )
  • DAG splitter - Splits the DAG into pieces and chooses which worker to send each piece to (#165 )
  • Sharder - Shards data to workers to improve performance/fault tolerance (#189 )
  • DAG walker - Walks the unfinished regions of the local DAG and chooses when and where to schedule thunks. Can also choose to split the DAG further by invoking the DAG splitter.
  • Monitor - Profiles executing thunks and data transfers, as well as monitors its local node, producing data that other components can use to tune their decisions.
  • Checkpointer - Chooses whether to perform checkpointing, restore, and cleanup actions (when such actions are available). This could be useful if persistent storage space is tight, or when it might be too slow to be worth the effort.

jpsamaroo avatar Jan 15 '21 21:01 jpsamaroo