platoon icon indicating copy to clipboard operation
platoon copied to clipboard

resnet 50

Open olimastro opened this issue 7 years ago • 15 comments

Regarding https://github.com/mila-udem/platoon/issues/94 issue. For now I did only a trial on synthetic data. I am reporting the time for one epoch of training which is 1000 data points:

1GPU batch size 80 (all that could fit on the DGX): 19.715s 2GPU batch size 40: 13.452s

olimastro avatar Jun 07 '17 05:06 olimastro

When you say timing after one computation pass, do you want me to take into account the overhead of calling ASGD()? If yes, I can only have one epoch have the same amount of data as the batch size and reuse the same code, if not, I can just time the difference between a single use with of 80 batch size and a single use with 40 batch size.

olimastro avatar Jun 07 '17 19:06 olimastro

By the way, I should mention that I used THEANO_FLAGS=device=cuda,floatX=float32,dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once,gpuarray.preallocate=0.95 and I updated Theano on the 5th of June.

olimastro avatar Jun 07 '17 19:06 olimastro

remove the dnn.conv.algo_* from the timing. As this do the timing in the first call, it could bias the timing. We don't care if you we don't have the fastest timing, but we should not bias the timing.

For the timing, we need to know the time of overhead, vs the compute time, vs the asgd.

We had in mind to use the a synchrone update. Mostly, it just split the minibatch on each gpu and sync the gradient. So you should also change that.

nouiz avatar Jun 07 '17 19:06 nouiz

The overall metric we are competing for is images / sec. While it is useful to time the different parts to see where we can improve things, this is the final metric we should report.

Regarding cuDNN options, again, if we want to correctly identify bottlenecks, I think we should use time_once. Just do one dummy call before starting training (and profiling) so that the right algorithm gets selected.

When we want to check for correctness, it is OK to split a batch between GPUs, and check if the synchronous update is consistent with what happens on only 1 GPU. However, for the final result, we want to have the full batch size on all GPUs, and report that.

lamblin avatar Jun 07 '17 22:06 lamblin

Can you please post exact command lines you have used? I keep getting OOM errors on startup if I preallocate 0.95 .. 0.5. If I do not preallocate, the errors are : WARNING! Failed to register in a local GPU comm world. Reason: 'utf8' codec can't decode byte 0xa5 in position 2: invalid start byte WARNING! Platoon all_reduce interface will not be functional. Traceback (most recent call last): File "resnet_worker.py", line 566, in train_resnet() File "resnet_worker.py", line 475, in train_resnet asgd.make_rule(params) File "/home/bfomitchev/git/theano/third_party/platoon/platoon/training/global_dynamics.py", line 161, in make_rule gup = AllReduceSum(update, inplace=True) File "/home/bfomitchev/git/theano/third_party/platoon/platoon/ops.py", line 155, in AllReduceSum return AllReduce(theano.scalar.add, inplace, worker)(src, dest) File "/home/bfomitchev/git/theano/third_party/platoon/platoon/ops.py", line 61, in init self._f16_ok = not self.worker._multinode AttributeError: 'Worker' object has no attribute '_multinode'

borisfom avatar Jun 27 '17 01:06 borisfom

Hi @borisfom It seems to be a installation error. You could try the solution under this issue to see if it works.

cshanbo avatar Jun 27 '17 07:06 cshanbo

@cshanbo: I have recompiled and reinstalled - still same issue. Can in be nccl 2.0 incompatibility ? How exactly did you run the benchmark ?

borisfom avatar Jun 28 '17 00:06 borisfom

Hi @borisfom I think it's not caused by nccl incompatibility. What I did is to follow the issue I mentioned above.

You could attach you running script, installation information etc, so that I might be able to help. Just to make sure, did you set your environmental variables, such as $PATH and $LD_LIBRARY_PATH?

cshanbo avatar Jun 28 '17 01:06 cshanbo

The command I use is THEANO_FLAGS=device=cpu python resnet_controller.py --single resnet /PATH/TO/OUT/FILES

olimastro avatar Jun 28 '17 02:06 olimastro

@cshanbo : What exactly should be added to PATH/LD_LIBRARY_PATH? What happens is this call returns garbage: self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx) And then utf8 decode fails on it: response = self.send_req("platoon-get_platoon_info", info={'device': self.device, 'local_id': self._local_id.comm_id.decode('utf-8')})

borisfom avatar Jun 28 '17 02:06 borisfom

Hi @borisfom , I think you should add the corresponding path of llb to LD_LIBRARY_PATH. For example, export LD_LIBRARY_PATH=/path/to/nccl/lib:$LD_LIBRARY_PATH You can try something like this

cshanbo avatar Jun 28 '17 11:06 cshanbo

NCCL is in standard system path (installed via .deb), tried adding it to LD_LIBRARY_PATH - no effect. GpuCommCliqueId would have failed in it was not found, right? Instead it returns garbage. Is it some initialization that could have been missed? Are you running Python 2 or 3 ?

borisfom avatar Jun 28 '17 15:06 borisfom

I'm running Python 2.7 with anaconda.

Can you make sure your nccl and pygpu installation correct?

  1. for nccl
  2. for pygpu

cshanbo avatar Jun 29 '17 08:06 cshanbo

We didn't try nccl 2 with platoon. This could be the cause of the problem if it's interface changed.

Le jeu. 29 juin 2017 04:38, Justin Chan [email protected] a écrit :

I'm running Python 2.7 with anaconda https://www.continuum.io/downloads.

Can you make sure your nccl and pygpu installation correct?

  1. for nccl https://github.com/NVIDIA/nccl
  2. for pygpu http://deeplearning.net/software/libgpuarray/installation.html#running-tests

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/mila-udem/platoon/pull/95#issuecomment-311900975, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-5jBHbSa71sE_rwv3OchW4Z1FFYzks5sI2KVgaJpZM4NyOMN .

nouiz avatar Jun 29 '17 12:06 nouiz

Here is the result of my benchmarking. The first number in the results cells is epoch 0 and the second number is the cumulative time until the end of epoch 1. I calculated the improvement by dividing epoch 0 of one gpu and epoch 0 of 2 gpus.

I think the base number is the one we want to report. If that's the case, I will rerun it 3 times to take an average.

One Gpu (Quadro K6000) Two Gpus (Quadro K6000s) Improvement
Base 91.63506, 209.2868 49.1683, 110.79238 1.86
With Dnn flags 85.8325, 190.6249 47.4801, 103.6666 1.80
With pre allocate flag 96.2377, 216.74893 51.97668, 115.8446 1.85
Base with profiling 130.5099, 291.01650 112.8854, 247.0402 1.56
With Dnn and profiling flags 125.09018, 272.49762 114.2143 , 248.42933 1.09
With pre allocate and profiling flags 134.58385, 297.5514 111.07611, 242.2090 1.21

pre allocate flag = gpuarray.preallocate=0.95 profiling flag = profile=True,profile_optimizer=True,profile_memory=True Dnn flags = dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once

ReyhaneAskari avatar Sep 06 '17 15:09 ReyhaneAskari