Dagger.jl icon indicating copy to clipboard operation
Dagger.jl copied to clipboard

Allow the scheduler to dynamically add/remove workers

Open jpsamaroo opened this issue 5 years ago • 8 comments

As discussed in #147 , it may benefit certain use cases to known when a worker is unused by Dagger entirely (specifically, no data cached on the worker) so that the worker can be removed from the Distributed pool.

jpsamaroo avatar Oct 05 '20 14:10 jpsamaroo

Expanding on this, it would be great if the scheduler could dynamically add new workers via Distributed whenever it believes that having extra workers would help decrease total runtime of the currently-loaded DAG. The scheduler would call a user-defined function to add workers, which could call into a custom ClusterManager. We would want to be able to specify what kinds of nodes are available (what kinds of processors and how many per node) so that, for example, GPU-only tasks would always have GPUs available. This would be interesting for interactive uses on HPC clusters or for accessing cloud platforms.

@DrChainsaw I see from Discourse that this is probably something you'd be interested in.

jpsamaroo avatar Jun 21 '21 13:06 jpsamaroo

This is something that would certainly come in handy for me! Let me know if you want me to test something out.

I do have a fear that I might have added some kind of seed of chaos here with #147 though. The day after #147 was merged there was this discussion in ClusterManagers and it seems like Distributed.jl is not designed for this type of dynamic usage (I felt a little bit like the intern who just pushed Integration Test Email #1 into production).

Or perhaps your proposed method will be more Distributed-friendly?

DrChainsaw avatar Jun 21 '21 20:06 DrChainsaw

I think @vchuravy was pointing out that because Distributed was originally designed for HPC clusters where startup is all at once, not all cluster managers will handle this well, and that's to be expected. But that doesn't preclude Distributed from handling this properly for cluster managers that do support dynamic worker changes (such as the LocalManager and probably SSHManager). I don't see why Dagger shouldn't be able to rely on Distributed to support this in at least some cases.

jpsamaroo avatar Jun 21 '21 23:06 jpsamaroo

I don't see why Dagger shouldn't be able to rely on Distributed to support this in at least some cases.

Alright, just wanted to point it out.

Oh, and in case the above was a polite request for a contribution I'd be happy to help, but I feel a bit insequre w.r.t how to make "it believes that having extra workers would help decrease total runtime".

Is there a straightforward way to do? I suppose one could just trigger when the scheduler hits the hook with scheduled tasks and no workers, or even just outsource everything to the user (e.g. here is the current state, do whatever you want).

DrChainsaw avatar Jun 22 '21 11:06 DrChainsaw

Oh, and in case the above was a polite request for a contribution I'd be happy to help

Not necessarily, I'm happy to do it as well (and the logic for starting/stopping workers is pretty trivial since you already added the logic to handle that in the scheduler).

but I feel a bit insequre w.r.t how to make "it believes that having extra workers would help decrease total runtime". Is there a straightforward way to do? I suppose one could just trigger when the scheduler hits the hook with scheduled tasks and no workers, or even just outsource everything to the user (e.g. here is the current state, do whatever you want).

Yeah, that's the key thing to be determined. This is one of those features where it's probably best to let the user define when this should happen, but we could provide some default code for this (say, trigger when it's been X seconds without any scheduling progress, or if the estimated time to DAG completion is greater than X minutes).

jpsamaroo avatar Jun 22 '21 20:06 jpsamaroo

This is one of those features where it's probably best to let the user define when this should happen, but we could provide some default code for this

Sounds like a reasonable approach to me. Don't hesitate to ping if there is anything added in #147 which is confusing or if there is something to try out.

DrChainsaw avatar Jun 23 '21 22:06 DrChainsaw

What is the story around initial loading of code on newly spun up workers, pass in a quote with all your using Package commands to be evaled in the worker's Main?

kolia avatar Aug 05 '21 14:08 kolia

Generally I use @everywhere using Package1, Package2, ..., which works fine. Distributed's code-loading story isn't great right now, but it's what we've got.

jpsamaroo avatar Aug 05 '21 15:08 jpsamaroo