Dagger.jl icon indicating copy to clipboard operation
Dagger.jl copied to clipboard

Distribute the scheduler

Open jpsamaroo opened this issue 5 years ago • 0 comments

We're already pushing some extra work onto the worker nodes (mainly argument fetching and processor selection/load balancing), and it would be beneficial for large DAGs to move more work onto each worker. The main blocker is providing a way to split the DAG into multiple domains, where each domain is handled by a given thread on a given worker. With efficient Thunk serialization, we can then send a subgraph to each worker and let them process their own DAG without conflicts. We'll need to add a mechanism by which thunks automatically wait on their input thunks to complete before they attempt to download the output data; if possible, we can also have workers broadcast and shard their Chunks onto dependent workers as soon as the data is made available.

jpsamaroo avatar Nov 17 '20 17:11 jpsamaroo