Flux.jl
Flux.jl copied to clipboard
data parallel distributed training
I'm a pytorch and mxnet user and Flux looks promising to me. I have 8 GPUs on the server and I want to train my model faster. Unfortunately, I see no document about parallel training on multiple GPUs. Is it possible to copy the model and initialize them on multiple GPUs and split the input data for them?
I find a pr #154 and it suggests that there may be difficulties on deep copy. Is there any progress now?
I don't think it's too difficult to do if you're willing to get your hands dirty; you'll want to look at NCCL and make copies of the model for each GPU, and run a training loop over all of them. If you get anywhere, it'd be great to have additions to the docs etc.
Unfortunately we don't have anything out of the box yet, though.
One can now choose between https://github.com/avik-pal/FluxMPI.jl, https://github.com/FluxML/DaggerFlux.jl/ and https://github.com/DhairyaLGandhi/ResNetImageNet.jl. The general consensus seems to be that this is not something that should be handled in core (at least for now), so further discussion should happen outside of the issue tracker.
I'd like to reopen this since all the packages mentioned are very experimental and not ready for prime time and also since I think that a distributed training solution should be a concern of Flux.jl. Maybe the solution will be implemented in a separate package, but I think we should at the very least document and give some examples of the "recommended" way to do data-parallel distributed training. We can discuss here what that recommended way should be.
@tkf can we have an easy win with transducers here ?
@avik-pal I like the simplicity of https://github.com/avik-pal/FluxMPI.jl interface. What is the package currently missing?
A (minor?) annoyance is having to run a training script with mpiexecjl -n <np> julia --project=. <filename>.jl, I would prefer to be able to run it in a standard julia session.
For one we still lack a CUDA-aware MPI JLL: https://github.com/JuliaPackaging/Yggdrasil/issues/2063. I agree with you about the niceness of the interface though, I think it's also the only solution that handles efficient multi-GPU reductions right now.
A (minor?) annoyance is having to run a training script with mpiexecjl -n
julia --project=. .jl, I would prefer to be able to run it in a standard julia session.
If we want to use MPI, we need to start the julia session like that. The alternative would be to use Distributed.jl for performing the reduction (this would allow us to use a standard julia session)
The alternative would be to use Distributed.jl for performing the reduction (this would allow us to use a standard julia session)
Ok, but we can think to that later, it's not too important. Anything important to iron out before advertising it and working a bit on the docs/examples?
Nothing particular that I am working on at the moment. I have already tested it out for a paper where we scale training on 6 processes.
I think dagger.jl would be better than distributed. Seems more robust to things like worker failure, different topologies, smarter scheduling. Any thoughts @jpsamaroo ?
For one we still lack a CUDA-aware MPI JLL
Just to put it out there. In case the compiled MPI is not CUDA aware, FluxMPI will do the reductions on CPU. Though yeah it would be great have a compiled JLL.
Parallel/distributing training is a large design space, and MPI covers a big portion of it, but also does so in a rigid manner that makes interactive development more cumbersome, and has certain assumptions about how programs will be written, which makes it difficult to do anything that's inherently task-parallel (which is often desired for logging, visualization, or integration with other Julia libraries which use multitasking).
I'm all for pushing FluxMPI.jl as one possible means to parallelize training (and the currently recommended approach), but I wouldn't advertise it (MPI) as the best way to do so generally, and would point out that there are other potential alternatives in incubation.
Agreed on all points, but with the current state of things FluxMPI is by far the closest to a Torch DDP equivalent when it comes to performance and ease of use. Therefore I think it's the best, most mature option at present for filling the DDP-sized hole we've observed in the ecosystem for the past few years, one which I know has deterred at least a few people from using Julia ML altogether.
Agree with Brian, not having distributed training is a huge pain point. Unless there is something else that could be made available to the layman in the very short term I think we should focus on developing and documenting FluxMPI. I guess creating a CUDA aware MPI artifact is a fundamental step we have to take.
Ok, this is fair, we do want to ensure we can scale training by any means necessary. I guess the questions I would then pose are: Are we willing to support FluxMPI for the foreseeable future, even after we have an alternative competing solution (because users will have built infrastructure around it)? Will it impose a large burden on the Flux maintainers to keep it working, even as Flux and the ecosystem evolves? And do the other maintainers (other than @avik-pal) know how to use MPI, are willing to learn how FluxMPI works at a deep level, and are you all willing to dive into the code if it needs maintenance?
DDP is a key feature, which is great to have, but that also means that if it doesn't work amazingly well for someone's supported use case, then the maintainers now have to fix that. I'm not trying to steer anyone away from this approach, but I do want to make sure the maintainers (not just @avik-pal) are willing to deal with long-term maintenance. The same question should also be posed when considering advertising https://github.com/DhairyaLGandhi/ResNetImageNet.jl/ or any other approach, of course.
And do the other maintainers (other than @avik-pal) know how to use MPI, are willing to learn how FluxMPI works at a deep level, and are you all willing to dive into the code if it needs maintenance?
For better or worse, Flux maintainers will have to learn all of these at some point if we want to support any kind of data parallel training.
The meat of FluxMPI is https://github.com/avik-pal/FluxMPI.jl/blob/main/src/mpi_extensions.jl. Really what we need for DDP is a way to use these collective primitives. Given that MPI.jl is the only game in town for this at present (unless we find a way to clone the JuliaGPU membership :wink:), the idea would be to put some more time into both dev + marketing for FluxMPI. I personally don't think this is a case of "ride or die" when it comes to support either. Data parallel training of DNNs in Julia is a niche within a niche within a niche, and it's more likely right now that we'll attract users who know what they're doing rather than who are completely new to ML.
Data parallel training of DNNs in Julia is a niche within a niche within a niche, and it's more likely right now that we'll attract users who know what they're doing rather than who are completely new to ML.
If you put the "Julia" part of this aside (because this point is only relevant for Julia users anyway), data-parallel training of DNNs is basically a requirement for doing anything realistic with DNNs (because of their training cost). DNNs are also hardly a small niche, comprising (anecdotally) a large proportion of ML funding and research. I suspect we will, in fact, attract many users who are relatively new to ML, but still want/need to do DDP DNN training.
Nitpicking aside, I'm just pointing out that the already-busy Flux maintainers will now have yet another very large project to advertise, monitor, and maintain. So what's the solution, if we want to push FluxMPI? Well, to start with, FluxMPI does not appear to have any tests, and there are only 2 examples (one for Flux, one for FastAI). Maybe those should be addressed before we start advertising FluxMPI?
data-parallel training of DNNs is basically a requirement for doing anything realistic with DNNs (because of their training cost).
This can be argued both ways. My anecdotal experience with both applied research (i.e. not about ML, but using ML) and industry is that multi-GPU (even on the same machine) shows up very, very infrequently. In fact I can't think of a single instance where someone tried distributed training for a project. Obviously it is happening quite frequently in absolute terms and is quite important, but I don't see it being close to a majority of all DL model usage because the vast majority of users just don't have access to the quantity of data and/or compute required to do it.
I suspect we will, in fact, attract many users who are relatively new to ML, but still want/need to do DDP DNN training.
I don't disagree, but my experience seeing people new to ML try to navigate this is that we would need to provide something turnkey. Like gpus=2 turnkey. And that is a far more ambitious goal than trying to get something out the door so that our response to "how do I do distributed training at ~PyTorch/TF speeds" isn't :shrug:.
Nitpicking aside, I'm just pointing out that the already-busy Flux maintainers will now have yet another very large project to advertise, monitor, and maintain. So what's the solution, if we want to push FluxMPI? Well, to start with, FluxMPI does not appear to have any tests, and there are only 2 examples (one for Flux, one for FastAI). Maybe those should be addressed before we start advertising FluxMPI?
No worries, this discussion has been long overdue! About bringing FluxMPI up to spec, yes absolutely. Honestly I'm fine with advertising either of the three current distributed training libraries if they can reach this threshold, but as you noted there's the perennial problem of finding time to get them there. This whole discussion started up again because we wanted to brainstorm ways out of that impasse, so any ideas on this would be very much appreciated.
Not sure if it's desirable to throw in yet another yak shaving, but, my impression is that Ray's actor abstraction is a very good building block for this. I'm suspecting and hoping that Dagger's mutability support (and implied more-or-less pinned tasks) can have equivalent expressive power. Though I guess we can safely say it's not ready until we have AllReduce on top of Dagger + GPU. @jpsamaroo Do we ave AllReduce on Dagger?.
@tkf we're planning to try out Dagger on https://github.com/DhairyaLGandhi/ResNetImageNet.jl/ once we gotten the semantics and synchronizations right. I haven't implemented AllReduce, but anyone is welcome to try. There shouldn't be anything about Dagger's API that prevents implementing an efficient AllReduce.