FluxMPI.jl
FluxMPI.jl copied to clipboard
Distributed Training Examples & Scalability Benchmarks
Currently, FluxMPI has only 1 example. It would be good to showcase training of more image models -- ViT (https://github.com/FluxML/Metalhead.jl/pull/105), ResNets, etc. from Metalhead and also benchmark their scaling across multiple GPUs.
I am not sure if it's appropriate to raise the concern here and it is vaguely related to this issue but for benchmarks, Can I suggest something of the sort like mlpack benchmarks. I really like how they are using valgrind for memory benchmarks and profiling, sqlite to store results, etc. The comparison amongst other ML libraries provides a better depiction of Flux and why to use Flux over other libraries.
I think that might be more relevant for FluxBench. I mainly want to test the scalability across GPUs something like https://horovod.readthedocs.io/en/stable/benchmarks.html
Can I ask for a minimal example without FastAI.jl? e.g. I'd like to see how this script should be changed for distributed training: https://github.com/FluxML/model-zoo/blob/master/vision/vgg_cifar10/vgg_cifar10.jl