platoon
platoon copied to clipboard
resnet 50
Regarding https://github.com/mila-udem/platoon/issues/94 issue. For now I did only a trial on synthetic data. I am reporting the time for one epoch of training which is 1000 data points:
1GPU batch size 80 (all that could fit on the DGX): 19.715s 2GPU batch size 40: 13.452s
When you say timing after one computation pass, do you want me to take into account the overhead of calling ASGD()? If yes, I can only have one epoch have the same amount of data as the batch size and reuse the same code, if not, I can just time the difference between a single use with of 80 batch size and a single use with 40 batch size.
By the way, I should mention that I used THEANO_FLAGS=device=cuda,floatX=float32,dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once,gpuarray.preallocate=0.95
and I updated Theano on the 5th of June.
remove the dnn.conv.algo_* from the timing. As this do the timing in the first call, it could bias the timing. We don't care if you we don't have the fastest timing, but we should not bias the timing.
For the timing, we need to know the time of overhead, vs the compute time, vs the asgd.
We had in mind to use the a synchrone update. Mostly, it just split the minibatch on each gpu and sync the gradient. So you should also change that.
The overall metric we are competing for is images / sec. While it is useful to time the different parts to see where we can improve things, this is the final metric we should report.
Regarding cuDNN options, again, if we want to correctly identify bottlenecks, I think we should use time_once
. Just do one dummy call before starting training (and profiling) so that the right algorithm gets selected.
When we want to check for correctness, it is OK to split a batch between GPUs, and check if the synchronous update is consistent with what happens on only 1 GPU. However, for the final result, we want to have the full batch size on all GPUs, and report that.
Can you please post exact command lines you have used? I keep getting OOM errors on startup if I preallocate 0.95 .. 0.5. If I do not preallocate, the errors are :
WARNING! Failed to register in a local GPU comm world.
Reason: 'utf8' codec can't decode byte 0xa5 in position 2: invalid start byte
WARNING! Platoon all_reduce
interface will not be functional.
Traceback (most recent call last):
File "resnet_worker.py", line 566, in
Hi @borisfom It seems to be a installation error. You could try the solution under this issue to see if it works.
@cshanbo: I have recompiled and reinstalled - still same issue. Can in be nccl 2.0 incompatibility ? How exactly did you run the benchmark ?
Hi @borisfom I think it's not caused by nccl incompatibility. What I did is to follow the issue I mentioned above.
You could attach you running script, installation information etc, so that I might be able to help. Just to make sure, did you set your environmental variables, such as $PATH and $LD_LIBRARY_PATH?
The command I use is THEANO_FLAGS=device=cpu python resnet_controller.py --single resnet /PATH/TO/OUT/FILES
@cshanbo : What exactly should be added to PATH/LD_LIBRARY_PATH? What happens is this call returns garbage: self._local_id = gpucoll.GpuCommCliqueId(context=self.gpuctx) And then utf8 decode fails on it: response = self.send_req("platoon-get_platoon_info", info={'device': self.device, 'local_id': self._local_id.comm_id.decode('utf-8')})
Hi @borisfom ,
I think you should add the corresponding path of llb to LD_LIBRARY_PATH
.
For example, export LD_LIBRARY_PATH=/path/to/nccl/lib:$LD_LIBRARY_PATH
You can try something like this
NCCL is in standard system path (installed via .deb), tried adding it to LD_LIBRARY_PATH - no effect. GpuCommCliqueId would have failed in it was not found, right? Instead it returns garbage. Is it some initialization that could have been missed? Are you running Python 2 or 3 ?
We didn't try nccl 2 with platoon. This could be the cause of the problem if it's interface changed.
Le jeu. 29 juin 2017 04:38, Justin Chan [email protected] a écrit :
I'm running Python 2.7 with anaconda https://www.continuum.io/downloads.
Can you make sure your nccl and pygpu installation correct?
- for nccl https://github.com/NVIDIA/nccl
- for pygpu http://deeplearning.net/software/libgpuarray/installation.html#running-tests
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/mila-udem/platoon/pull/95#issuecomment-311900975, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-5jBHbSa71sE_rwv3OchW4Z1FFYzks5sI2KVgaJpZM4NyOMN .
Here is the result of my benchmarking. The first number in the results cells is epoch 0 and the second number is the cumulative time until the end of epoch 1. I calculated the improvement by dividing epoch 0 of one gpu and epoch 0 of 2 gpus.
I think the base number is the one we want to report. If that's the case, I will rerun it 3 times to take an average.
One Gpu (Quadro K6000) | Two Gpus (Quadro K6000s) | Improvement | |
---|---|---|---|
Base | 91.63506, 209.2868 | 49.1683, 110.79238 | 1.86 |
With Dnn flags | 85.8325, 190.6249 | 47.4801, 103.6666 | 1.80 |
With pre allocate flag | 96.2377, 216.74893 | 51.97668, 115.8446 | 1.85 |
Base with profiling | 130.5099, 291.01650 | 112.8854, 247.0402 | 1.56 |
With Dnn and profiling flags | 125.09018, 272.49762 | 114.2143 , 248.42933 | 1.09 |
With pre allocate and profiling flags | 134.58385, 297.5514 | 111.07611, 242.2090 | 1.21 |
pre allocate flag = gpuarray.preallocate=0.95 profiling flag = profile=True,profile_optimizer=True,profile_memory=True Dnn flags = dnn.conv.algo_fwd=time_once,dnn.conv.algo_bwd_filter=time_once,dnn.conv.algo_bwd_data=time_once