chainermn ChainerMN's ImageNet example is slower than Chainer's data parallel

ChainerMN's ImageNet example is slower than Chainer's data parallel

Open LWisteria opened this issue 7 years ago • 2 comments

You might know this already, recently I tried ChainerMN on Sakura Koukaryoku Computing.

I measured processing throughput by ImageNet example and compared ChainerMN's train_imagenet.py to Chainer's train_imagenet_data_parallel.py

// Chainer
$ python train_imagenet_data_parallel.py /opt/traindata/ILSVRC2012/train.ssv /opt/traindata/ILSVRC2012/val.ssv -a resnet50

// ChainerMN
$ mpiexec -n 4 python train_imagenet.py /opt/traindata/ILSVRC2012/train.ssv /opt/traindata/ILSVRC2012/val.ssv -a resnet50

Other detailed environment settings are written on my blog post (sorry for in Japanese).

The result showed the ChainerMN's was slower than Chainer's. result

What happened and can I improve ChainerMN's performance?

Please ask me if you have any questions and request me if you want to get the same ImageNet images to reproduce this problem

Dec 05 '17 09:12 LWisteria

Thank you for reporting this! I personally don't think this is generally the case; For example, in our recent experiments (https://arxiv.org/abs/1711.04325), our throughput on ResNet50 with ChainerMN was kind of state-of-the-art in comparison with other efficient frameworks such as Caffe 2. I assume that your result is because of environment of configuration. Anyway, @shu65 will investigate on it soon.

Dec 13 '17 06:12 iwiwi

@iwiwi Has this problem been solved?

Sep 13 '18 01:09 hgjung3

chainermn chainermn copied to clipboard

ChainerMN's ImageNet example is slower than Chainer's data parallel

chainermn
chainermn copied to clipboard