Synkhronos icon indicating copy to clipboard operation
Synkhronos copied to clipboard

multi-node support

Open astooke opened this issue 6 years ago • 4 comments

Starting a new issue in reference to question: (https://github.com/astooke/Synkhronos/issues/11#issuecomment-326628646)

I have not experimented with running Synkhronos multi-node. Currently it's only built for single-node. To run multi-node would require another layer to coordinate and communicate among nodes. Certainly sounds possible, with a separate instance of the current Synkhronos running on each node. I haven't put a lot of thought into this yet, because my current research is well-suited to running single-node.

Apparently the new version of NCCL, version 2, supports inter-node communication. I have not tried it yet (Synkhronos is currently built on version 1). Synkhronos uses NCCL through libgpuarray and pygpu...I'm not sure what the compatibility status is through that chain.

Note that a key to scaling well to 256 GPUs in the large minibatch ResNet paper is to start communicating on gradients as they are computed layer-by-layer, simultaneously with performing the rest of the backpropagation.

I'd be curious to hear if you try anything!

Have you tried any other packages / libraries for running multi-GPU? e.g. TensorFlow, PyTorch, Chainer? And how does using them compare to Synkhronos?

astooke avatar Sep 01 '17 17:09 astooke