platoon resnet 50

Regarding https://github.com/mila-udem/platoon/issues/94 issue. For now I did only a trial on synthetic data. I am reporting the time for one epoch of training which is 1000 data points:

1GPU batch size 80 (all that could fit on the DGX): 19.715s 2GPU batch size 40: 13.452s

Jun 07 '17 05:06 olimastro

When you say timing after one computation pass, do you want me to take into account the overhead of calling ASGD()? If yes, I can only have one epoch have the same amount of data as the batch size and reuse the same code, if not, I can just time the difference between a single use with of 80 batch size and a single use with 40 batch size.

Jun 07 '17 19:06 olimastro

By the way, I should mention that I used THEANO_FLAGS=device=cuda,floatX=float32,dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once,gpuarray.preallocate=0.95 and I updated Theano on the 5th of June.

Jun 07 '17 19:06 olimastro

remove the dnn.conv.algo_* from the timing. As this do the timing in the first call, it could bias the timing. We don't care if you we don't have the fastest timing, but we should not bias the timing.

For the timing, we need to know the time of overhead, vs the compute time, vs the asgd.

We had in mind to use the a synchrone update. Mostly, it just split the minibatch on each gpu and sync the gradient. So you should also change that.

Jun 07 '17 19:06 nouiz

The overall metric we are competing for is images / sec. While it is useful to time the different parts to see where we can improve things, this is the final metric we should report.

Regarding cuDNN options, again, if we want to correctly identify bottlenecks, I think we should use time_once. Just do one dummy call before starting training (and profiling) so that the right algorithm gets selected.

When we want to check for correctness, it is OK to split a batch between GPUs, and check if the synchronous update is consistent with what happens on only 1 GPU. However, for the final result, we want to have the full batch size on all GPUs, and report that.

Jun 07 '17 22:06 lamblin

Can you please post exact command lines you have used? I keep getting OOM errors on startup if I preallocate 0.95 .. 0.5. If I do not preallocate, the errors are : WARNING! Failed to register in a local GPU comm world. Reason: 'utf8' codec can't decode byte 0xa5 in position 2: invalid start byte WARNING! Platoon all_reduce interface will not be functional. Traceback (most recent call last): File "resnet_worker.py", line 566, in train_resnet() File "resnet_worker.py", line 475, in train_resnet asgd.make_rule(params) File "/home/bfomitchev/git/theano/third_party/platoon/platoon/training/global_dynamics.py", line 161, in make_rule gup = AllReduceSum(update, inplace=True) File "/home/bfomitchev/git/theano/third_party/platoon/platoon/ops.py", line 155, in AllReduceSum return AllReduce(theano.scalar.add, inplace, worker)(src, dest) File "/home/bfomitchev/git/theano/third_party/platoon/platoon/ops.py", line 61, in init self._f16_ok = not self.worker._multinode AttributeError: 'Worker' object has no attribute '_multinode'

Jun 27 '17 01:06 borisfom

Hi @borisfom It seems to be a installation error. You could try the solution under this issue to see if it works.

Jun 27 '17 07:06 cshanbo

@cshanbo: I have recompiled and reinstalled - still same issue. Can in be nccl 2.0 incompatibility ? How exactly did you run the benchmark ?

Jun 28 '17 00:06 borisfom

Hi @borisfom I think it's not caused by nccl incompatibility. What I did is to follow the issue I mentioned above.

You could attach you running script, installation information etc, so that I might be able to help. Just to make sure, did you set your environmental variables, such as $PATH and $LD_LIBRARY_PATH?

Jun 28 '17 01:06 cshanbo

The command I use is THEANO_FLAGS=device=cpu python resnet_controller.py --single resnet /PATH/TO/OUT/FILES

Jun 28 '17 02:06 olimastro

@cshanbo : What exactly should be added to PATH/LD_LIBRARY_PATH? What happens is this call returns garbage: self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx) And then utf8 decode fails on it: response = self.send_req("platoon-get_platoon_info", info={'device': self.device, 'local_id': self._local_id.comm_id.decode('utf-8')})

Jun 28 '17 02:06 borisfom

Hi @borisfom , I think you should add the corresponding path of llb to LD_LIBRARY_PATH. For example, export LD_LIBRARY_PATH=/path/to/nccl/lib:$LD_LIBRARY_PATH You can try something like this

Jun 28 '17 11:06 cshanbo

NCCL is in standard system path (installed via .deb), tried adding it to LD_LIBRARY_PATH - no effect. GpuCommCliqueId would have failed in it was not found, right? Instead it returns garbage. Is it some initialization that could have been missed? Are you running Python 2 or 3 ?

Jun 28 '17 15:06 borisfom

I'm running Python 2.7 with anaconda.

Can you make sure your nccl and pygpu installation correct?

for nccl
for pygpu

Jun 29 '17 08:06 cshanbo

We didn't try nccl 2 with platoon. This could be the cause of the problem if it's interface changed.

Le jeu. 29 juin 2017 04:38, Justin Chan [email protected] a écrit :

I'm running Python 2.7 with anaconda https://www.continuum.io/downloads.

Can you make sure your nccl and pygpu installation correct?

for nccl https://github.com/NVIDIA/nccl

for pygpu http://deeplearning.net/software/libgpuarray/installation.html#running-tests

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/mila-udem/platoon/pull/95#issuecomment-311900975, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-5jBHbSa71sE_rwv3OchW4Z1FFYzks5sI2KVgaJpZM4NyOMN .

Jun 29 '17 12:06 nouiz

Here is the result of my benchmarking. The first number in the results cells is epoch 0 and the second number is the cumulative time until the end of epoch 1. I calculated the improvement by dividing epoch 0 of one gpu and epoch 0 of 2 gpus.

I think the base number is the one we want to report. If that's the case, I will rerun it 3 times to take an average.

	One Gpu (Quadro K6000)	Two Gpus (Quadro K6000s)	Improvement
Base	91.63506, 209.2868	49.1683, 110.79238	1.86
With Dnn flags	85.8325, 190.6249	47.4801, 103.6666	1.80
With pre allocate flag	96.2377, 216.74893	51.97668, 115.8446	1.85
Base with profiling	130.5099, 291.01650	112.8854, 247.0402	1.56
With Dnn and profiling flags	125.09018, 272.49762	114.2143 , 248.42933	1.09
With pre allocate and profiling flags	134.58385, 297.5514	111.07611, 242.2090	1.21

pre allocate flag = gpuarray.preallocate=0.95 profiling flag = profile=True,profile_optimizer=True,profile_memory=True Dnn flags = dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once

Sep 06 '17 15:09 ReyhaneAskari

platoon platoon copied to clipboard

resnet 50

platoon
platoon copied to clipboard