Distributed training (across multiple machines)
According to this, https://mxnet.incubator.apache.org/faq/multi_devices.html, MxNET supports training on a distributed cluster across several machines (with multiple GPUs per machine).
I'm looking to train using this repo in a distributed setup with the following assumptions.
- Each machine (with Ubuntu) has 4 K80 GPUs (2 physical cards)
- Set up a cluster of 8 or 10 such machines
Before I start, I wanted to check if anyone has tried training this repo in such a distributed setup? If so, could you share any setup requirements, settings etc
If you've not tried this repo specifically but tried mxnet distributed training in general, could you share any "heads-up" issues, settings, requirements etc to get the training working in a distributed environment?
I'm working on this too, have you done? Any advice? thx.
The MXNet official reproduce several RCNN models with SOTA result in https://github.com/dmlc/gluon-cv I think you can migrate to Gluon-CV.