FluxMPI.jl icon indicating copy to clipboard operation
FluxMPI.jl copied to clipboard

Distributed Training Examples & Scalability Benchmarks

Open avik-pal opened this issue 3 years ago • 3 comments

Currently, FluxMPI has only 1 example. It would be good to showcase training of more image models -- ViT (https://github.com/FluxML/Metalhead.jl/pull/105), ResNets, etc. from Metalhead and also benchmark their scaling across multiple GPUs.

avik-pal avatar Feb 02 '22 16:02 avik-pal

I am not sure if it's appropriate to raise the concern here and it is vaguely related to this issue but for benchmarks, Can I suggest something of the sort like mlpack benchmarks. I really like how they are using valgrind for memory benchmarks and profiling, sqlite to store results, etc. The comparison amongst other ML libraries provides a better depiction of Flux and why to use Flux over other libraries.

dnabanita7 avatar Feb 02 '22 17:02 dnabanita7

I think that might be more relevant for FluxBench. I mainly want to test the scalability across GPUs something like https://horovod.readthedocs.io/en/stable/benchmarks.html

avik-pal avatar Feb 02 '22 17:02 avik-pal

Can I ask for a minimal example without FastAI.jl? e.g. I'd like to see how this script should be changed for distributed training: https://github.com/FluxML/model-zoo/blob/master/vision/vgg_cifar10/vgg_cifar10.jl

CarloLucibello avatar Mar 14 '22 16:03 CarloLucibello