Deformable-ConvNets icon indicating copy to clipboard operation
Deformable-ConvNets copied to clipboard

Distributed training (across multiple machines)

Open arunbuduri opened this issue 7 years ago • 2 comments

According to this, https://mxnet.incubator.apache.org/faq/multi_devices.html, MxNET supports training on a distributed cluster across several machines (with multiple GPUs per machine).

I'm looking to train using this repo in a distributed setup with the following assumptions.

  1. Each machine (with Ubuntu) has 4 K80 GPUs (2 physical cards)
  2. Set up a cluster of 8 or 10 such machines

Before I start, I wanted to check if anyone has tried training this repo in such a distributed setup? If so, could you share any setup requirements, settings etc

If you've not tried this repo specifically but tried mxnet distributed training in general, could you share any "heads-up" issues, settings, requirements etc to get the training working in a distributed environment?

arunbuduri avatar Apr 20 '18 01:04 arunbuduri

I'm working on this too, have you done? Any advice? thx.

YoWhatever avatar Sep 01 '18 07:09 YoWhatever

The MXNet official reproduce several RCNN models with SOTA result in https://github.com/dmlc/gluon-cv I think you can migrate to Gluon-CV.

chinakook avatar Sep 01 '18 11:09 chinakook