Deformable-ConvNets Distributed training (across multiple machines)

According to this, https://mxnet.incubator.apache.org/faq/multi_devices.html, MxNET supports training on a distributed cluster across several machines (with multiple GPUs per machine).

I'm looking to train using this repo in a distributed setup with the following assumptions.

Each machine (with Ubuntu) has 4 K80 GPUs (2 physical cards)
Set up a cluster of 8 or 10 such machines

Before I start, I wanted to check if anyone has tried training this repo in such a distributed setup? If so, could you share any setup requirements, settings etc

If you've not tried this repo specifically but tried mxnet distributed training in general, could you share any "heads-up" issues, settings, requirements etc to get the training working in a distributed environment?

Apr 20 '18 01:04 arunbuduri

I'm working on this too, have you done? Any advice? thx.

Sep 01 '18 07:09 YoWhatever

The MXNet official reproduce several RCNN models with SOTA result in https://github.com/dmlc/gluon-cv I think you can migrate to Gluon-CV.

Sep 01 '18 11:09 chinakook