chainermn icon indicating copy to clipboard operation
chainermn copied to clipboard

ChainerMN: Scalable distributed deep learning with Chainer

Results 13 chainermn issues
Sort by recently updated
recently updated
newest added

In the current code, if a user does ``` $ pip install chainermn ``` it installs Chainer 4 and forces uninstalling newer Chainer (such as v5 and v6). It is...

You might know this already, recently I tried ChainerMN on [Sakura Koukaryoku Computing](https://www.sakura.ad.jp/koukaryoku/). I measured processing throughput by ImageNet example and compared [ChainerMN's train_imagenet.py](https://github.com/chainer/chainermn/tree/master/examples/imagenet/train_imagenet.py) to [Chainer's train_imagenet_data_parallel.py ](https://github.com/chainer/chainer/blob/master/examples/imagenet/train_imagenet_data_parallel.py) ``` //...

At least we know with FP16 model Communicator's `bcast_data` does not work. ```diff diff --git a/tests/chainermn_tests/communicator_tests/test_communicator.py b/tests/chainermn_tests/communicator_tests/test_communicator.py index a0ff350..f03fa5d 100644 --- a/tests/chainermn_tests/communicator_tests/test_communicator.py +++ b/tests/chainermn_tests/communicator_tests/test_communicator.py @@ -242,6 +242,12 @@ def test_communicator_cpu(param):...

feature

Current `scatter_dataset` creates sub datasets of strictly equal lengths by duplicating some examples when necessary. This is for epoch triggers to work correctly. However, it is generally unnecessary for validator...

feature

https://github.com/chainer/chainermn/blob/master/docs/source/reference/index.rst#communicators

document

ChainerMN has mostly-copied BatchNormalization code (but several AllReduce added), which means potential bugs from Chainer could also be imported. https://github.com/chainer/chainer/pull/4191 could be one of them; porting it to ChainerMN seems...

bug
question

As a backlog that came from #234 and #237 .

feature

This issue is not inherent to chainermn, so I was confused where to submit it. In the [training example of ImageNet](https://github.com/chainer/chainermn/blob/master/examples/imagenet/train_imagenet.py), I cannot run the test without removing the `multiprocessing.set_start_method('forkserver')`...

This issue is the central place to discuss the future plans. Any suggestion and contribution are appreciated. We only discuss relatively large tasks here, and smaller tasks are managed in...

roadmap