Dagger.jl
Dagger.jl copied to clipboard
Distribute the scheduler!
This PR allows the scheduler to execute itself on all workers in the cluster. We first expand the notion of "thunk ID" to be per-worker, so that we can locally allocate unique IDs (and later locate where a thunk was created), and then allow the eager scheduler code to execute on all workers, instead of remotecall'ing to worker 1. We then allow thunks to be registered and scheduled locally (which should make recursive runtime-generated graphs vastly more efficient, no longer having to make a trip over the network). Finally, we implement local (and optionally remote?) work stealing (strictly for already-scheduled tasks, for the time being) to allow work to be kept balanced. The newly-available scheduler metrics on each worker will make it possible to optimize the choice of processor to steal from, although this can be left for later work.
Todo:
- [x] Execute the eager scheduler on all workers
- [ ] Implement local work stealing with ConcurrentCollection's
WorkStealingDeque - [ ] Tests for
@spawn/add_thunk!with thunks owned by other schedulers - [ ] Document new behavior
- [ ] Validate the web dashboard shows remote scheduler data
- [ ] Benchmarks
Closes https://github.com/JuliaParallel/Dagger.jl/issues/165
I'm considering lower-bounding Julia to 1.7 going forward, so that we can easily work with atomics and use packages like ConcurrentCollections.jl and AtomicArrays.jl. If we do this, we'll do a minor version bump to 0.15, and keep 0.14.x as the last set of versions supporting Julia 1.6. We'll backport bug fixes and critical performance work to that branch if they're reproducible on Julia 1.6, and I'm also happy to backport any features that can be safely used on Julia 1.6 (assuming it doesn't rely on more recent changes, such as in this PR).
This is being developed on jps/dev, and will be re-posted when ready.