chainermn
chainermn copied to clipboard
ChainerMN's ImageNet example is slower than Chainer's data parallel
You might know this already, recently I tried ChainerMN on Sakura Koukaryoku Computing.
I measured processing throughput by ImageNet example and compared ChainerMN's train_imagenet.py to Chainer's train_imagenet_data_parallel.py
// Chainer
$ python train_imagenet_data_parallel.py /opt/traindata/ILSVRC2012/train.ssv /opt/traindata/ILSVRC2012/val.ssv -a resnet50
// ChainerMN
$ mpiexec -n 4 python train_imagenet.py /opt/traindata/ILSVRC2012/train.ssv /opt/traindata/ILSVRC2012/val.ssv -a resnet50
Other detailed environment settings are written on my blog post (sorry for in Japanese).
The result showed the ChainerMN's was slower than Chainer's.
What happened and can I improve ChainerMN's performance?
Please ask me if you have any questions and request me if you want to get the same ImageNet images to reproduce this problem
Thank you for reporting this! I personally don't think this is generally the case; For example, in our recent experiments (https://arxiv.org/abs/1711.04325), our throughput on ResNet50 with ChainerMN was kind of state-of-the-art in comparison with other efficient frameworks such as Caffe 2. I assume that your result is because of environment of configuration. Anyway, @shu65 will investigate on it soon.
@iwiwi Has this problem been solved?