convnet-benchmarks
convnet-benchmarks copied to clipboard
[August 2014] Discussion of results
This issue shall serve as a place to announce and discuss new results, to avoid the discussion being spread over several pull requests that just happened to be there (#11, #12).
So with 33f212238d, caffe seems to be really fast in the backward pass. Looking at the code, are you 100% sure that propagate_down
is set to True? Otherwise it would time the gradient wrt. weights only.
propagate_down is indeed true! This is the benchmark code, and the place where it is set to true.
Are you sure caffe also updates the parameters (or accumulates the parameter gradients) in the code?
yes, it does that here: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu#L79
if we want a double confirmation, we can ask @rbgirshick who is watching this repo. Ross, what do you say?
Yeah your link seems to confirm it. Yet the difference between forward and backward is so much smaller for caffe. I wonder what their secret is. Isn't SpatialConvolutionMM supposed to be using the same tricks anyhow?
Isn't SpatialConvolutionMM supposed to be using the same tricks anyhow?
Yes, SpatialConvolutionMM is borrowed from Caffe (thanks guys), but we are using the cublas v1 interface (whereas Caffe uses cublas v2), we should fix that soon.
Another question: When you time, do you synchronize with the device somewhere? (Otherwise it would synchronize on the next cudaMemset
call or similarly, which may be hidden in the next caffe_net
call.) I don't see it in your benchmark, but maybe layers[i]->Backward
does more than just calling Backward_gpu
or I missed something else.
@f0k that does seem to be a good point! the utility I use is Caffe's own, and looking through (https://github.com/BVLC/caffe/blob/master/include/caffe/layer.hpp), I don't see an explicit synchronize either. I'll have to look into this further.
I see, the code is here: https://github.com/BVLC/caffe/blob/master/src/caffe/util/benchmark.cpp
So basically you insert a start and stop event into the stream (not sure if that's the correct terminology) with cudaEventRecord
, then wait for the stop event to be run over, and finally return the time difference between the events with cudaEventElapsedTime
. Seems proper, but maybe the other benchmarks should do the same to ensure only the GPU time is measured, or the caffe benchmark should time "from the outside" just like the others.
but maybe the other benchmarks should do the same to ensure only the GPU time is measured, or the caffe benchmark should time "from the outside" just like the others.
I could write a caffe benchmark to time "from the outside", but I doubt the speed is going to change by much, if at all.
@f0k I think a call cudaDeviceSynchronize
after the calls to forward/backward. Simpler than events, and conforms to other benchmarks.
@f0k I think a call cudaDeviceSynchronize after the calls to forward/backward. Simpler than events, and conforms to other benchmarks.
@nicholas-leonard I was just explaining what the currently used caffe benchmark code does, to answer my own question of whether it's properly synchronized. So yes, it is properly synchronized, but it does things differently than the other benchmarks.
The Theano benchmark have a sync before the timming too. I don't know if that is useful, but I think consistency between software is better.:
theano.sandbox.cuda.synchronize()
start = time.time()
for i in range(steps):
fprop()
theano.sandbox.cuda.synchronize()
tm = (time.time()-start)/steps
On Tue, Aug 5, 2014 at 11:50 AM, Nicholas Léonard [email protected] wrote:
@f0k https://github.com/f0k I think a call cudaDeviceSynchronize after the calls to forward/backward. Simpler than events, and conforms to other benchmarks.
— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/16#issuecomment-51217105 .
Perhaps the gradient w.r.t. inputs should be left out of the L1 benchmark, because it would not actually be computed in the first layer of a real network. Theano results seem particularly skewed by this factor.
I think it should be available as information somewhere, as for the middle layer it is computed. But maybe we could do 2 summary (fprop + grad weidght) and (fprop + the 2 grad).
On Wed, Aug 20, 2014 at 3:11 PM, Andrew Lavin [email protected] wrote:
Perhaps the gradient w.r.t. inputs should be left out of the L1 benchmark, because it would not actually be computed in the first layer of a real network. Theano results seem particularly skewed by this factor.
— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/16#issuecomment-52828584 .
@nouiz L1 is clearly not a middle layer, so the only necessary change is to remove the gradInput calculation from the L1 results.
I benchmarked theano, torch, ccn2 and caffe with cuda 6.5. Also Alex updated ccn2 to add some more performance improvements. Also, fixed theano benchmark to average over 10 iterations (like the others), rather than taking the minimum of 10 iterations. Also added theano's corrMM to the table.
ccn2 is now a very close second to Caffe (caffe is cheating a bit in the benchmark because of timing just around the cuda kernel, and not the rest of the cpu cleanup around it).
Hey @soumith , the *** next to corrMM is not correct (uses a lot of extra memory).
right, fixed.
Thanks! And thanks for updating the table.
I agree with @andravin about omitting the L1 grad-wrt-inputs from the overall summary (at least in a special column). That really hurts the Theano legacy numbers, and it isn't even relevant for how these routines would actually be used.
Thanks to @nouiz who fixed an issue with theano fft gradInput, now Theano FFT implementation is 1.5x faster than Caffe (which is 2nd place).
Now is the time to also start a conversation about 3x3 convolutions, as this year's Imagenet challenge had the VGG team lead by Karen Simonyan use just 3x3 convolutions across the whole network, starting from 224x224 image! http://arxiv.org/abs/1409.1556/
Should I add a few layer benchmarks that are relevant?
Thanks a lot for sharing this paper!
Wow 16–19 weight layers?! Perhaps we are really moving towards the networks like in the brain. Too bad recurrent nets are still hard to train. We definitely also need to be able to train a model on multiple gpu's in the future. @soumith, does torch plan to do anything on these lines in the near future(multiple GPU support)?
@stencilman it depends on how transparent you want it to be (multigpu can already be done in torch if you are careful with a few things), but this is not the thread to discuss it, let's open a thread in cutorch
@soumith Yes, sorry, lets open a thread in cutorch.
@ebattenberg I was waiting for this year's imagenet challenge to finish, but now you see why the GradInput and GradWeight numbers for L1 are starting to get relevant!
Thanks to @nouiz who fixed an issue with theano fft gradInput, now Theano FFT implementation is 1.5x faster than Caffe (which is 2nd place).
Oh cool! Do you know what exactly was fixed?
https://github.com/nouiz/Theano/commit/092e81e009cd531776ef804633f19be1d4f8787a
About the next convolution benchmark, what about 3d convolution for video?
On Mon, Sep 8, 2014 at 5:37 AM, Soumith Chintala [email protected] wrote:
nouiz/Theano@092e81e https://github.com/nouiz/Theano/commit/092e81e009cd531776ef804633f19be1d4f8787a
— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/16#issuecomment-54795755 .
@nouiz I am totally up for it, Theano and Caffe have Volumetric convolutions!
Added NVidia CuDNN results to table
Although the CuDNN results in this benchmark are not too spectacular, I tried it out myself today with Theano, and compared it with some other available implementations (legacy conv2d, fftconv, GpuCorrMM).
CuDNN mainly seems to have an edge for convolutions with lots of small filters (i.e. 3x3). I saw speedups of almost 2x compared to GpuCorrMM, and it was often faster than fftconv as well (I didn't compare to cuda-convnet because the call signature is different).
Considering that the 2nd placed entry in this year's ImageNet competition was a very deep convnet with 3x3 filters everywhere, this might actually turn out to be an interesting configuration to benchmark, imho.
if you see the torch7 folder, i benchmarked all the VGG layers here: https://github.com/soumith/convnet-benchmarks/blob/master/torch7/vgg.bench
Interesting, from those results the conclusion would be pretty different, if I'm reading them correctly. I wonder why it makes such a big difference for what I tried (very simple network on MNIST, C3-C3-P2-C3-C3-P2-F-F-F). I guess there could be other parameters that play a role, or the fact that I'm using Theano, or maybe the GTX 680 I ran it on.
@benanne: if you look at https://github.com/soumith/convnet-benchmarks/commit/4210e80c37da9d0d550317c433ee02d80f286a64, it seems to be that in Theano only the fprop has been optimized for CuDNN. Others not.
@benanne the input and output image sizes also matter quite a bit. So MNIST wouldn't be representational of the performance of the VGG layers even though the filter sizes might be the same
@stencilman I see, so it's just using the forward pass implementation for all three operations (forward, backward w.r.t. weights, backward w.r.t. input). But that doesn't explain why it seems to come out on top when I test it.
@soumith That, however, would explain some things :)
Just added benchmarks for Maxime Oquab's (@qassemoquab) BHWD layout kernels. https://github.com/qassemoquab/nnbhwd They perform pretty decent for the current benchmark layers, but this benchmark doesn't really do justice. His module really shines when there are lots and lots of feature maps (applicable to certain domains). He does banded unrolling + sgemm
@benanne: if you look at 4210e80 , it seems to be that in Theano only the fprop has been optimized for CuDNN.
@stencilman I see, so it's just using the forward pass implementation for all three operations (forward, backward w.r.t. weights, backward w.r.t. input).
Hmm, that's interesting. @abergeron convinced me that cuDNN shouldn't perform worse when just using cudnnConvolutionForward
for everything, because it should select the optimal kernel to use under the hood -- e.g., asking it to do a forward pass with padding matching the kernel size should be as fast as asking it to do a backward pass wrt. input (full convolution). Maybe that's not the case. We should investigate and file a bug with Nvidia.
To see if that is really a problem you can try the new version of the op that is https://github.com/Theano/Theano/pull/2117. If used directly the grad will use the Backward calls.
Also, I don't know what sort of timings you have, but I know that if you rely on the optimization to use cudnn for the gradient, you will end up with a bunch of potentially useless copies (because theano flips the kernels manually before the convolution and I have to do a call to gpu_contiguous). This might be the source of the slowdown on the gradient part.
It might be possible to avoid the gpu_contigous call in most cases, but cudnn officially does not support negative strides which tend to happen somewhat often in the convolution case. More work (but not too much) would be needed to make it work without it.
I re-ran the test with this patch:
diff --git a/theano/pylearn2_benchmark.py b/theano/pylearn2_benchmark.py
index a110628..be4c75d 100644
--- a/theano/pylearn2_benchmark.py
+++ b/theano/pylearn2_benchmark.py
@@ -203,7 +203,17 @@ for run in runs:
mode = theano.compile.get_default_mode()
mode = mode.including('cudnn')
benchmark_three_ways('(experimental, auto) theano.sandbox.cuda.dnn.GpuDnnConv',
- sharedX, sharedY, sharedW, X, Y, gW, gX, mode)
+ sharedX, sharedY, sharedW, X, Y, gW, gX, mode)
+
+ mode = theano.compile.get_default_mode()
+ mode = mode.including('gpu')
+ dnnX = theano.sandbox.cuda.CudaNdarrayType((False, False, False, False))(name='dnnX')
+ dnnY = theano.sandbox.cuda.dnn.dnn_conv(sharedX, sharedW, border_mode='valid', subsample=(dw,dh))
+ dnngW = theano.grad(None, wrt=sharedW, known_grads={dnnY: sharedY})
+ dnngX = theano.grad(None, wrt=sharedX, known_grads={dnnY: sharedY})
+ benchmark_three_ways('(experimental, manual) '
+ 'theano.sandbox.cuda.dnn.GpuDnnConv',
+ sharedX, sharedY, sharedW, dnnX, dnnY, dnngW, dnngX, mode)
# benchmark caffe-like gemm convolution
# Mimic Theano flag THEANO_FLAGS=optimizer_including=conv_gemm
And got these results:
Using gpu device 0: GeForce GTX 750 Ti
Note: pycuda not available, no timing via CUDA events possible
CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
theano.tensor.nnet.conv.conv2d ==> fprop ==> 1885.0
theano.tensor.nnet.conv.conv2d ==> bprop inputs ==> 54169.0
theano.tensor.nnet.conv.conv2d ==> bprop weights ==> 1836.0
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: FAILED
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: FAILED
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: FAILED
(experimental, auto) theano.sandbox.cuda.dnn.GpuDnnConv ==> fprop ==> 268.0
(experimental, auto) theano.sandbox.cuda.dnn.GpuDnnConv ==> bprop inputs ==> 3262.0
(experimental, auto) theano.sandbox.cuda.dnn.GpuDnnConv ==> bprop weights ==> 596.0
(experimental, manual) theano.sandbox.cuda.dnn.GpuDnnConv ==> fprop ==> 269.0
(experimental, manual) theano.sandbox.cuda.dnn.GpuDnnConv ==> bprop inputs ==> 288.0
(experimental, manual) theano.sandbox.cuda.dnn.GpuDnnConv ==> bprop weights ==> 326.0
(experimental, auto) theano.sandbox.cuda.blas.GpuCorrMM ==> fprop ==> 393.0
(experimental, auto) theano.sandbox.cuda.blas.GpuCorrMM ==> bprop inputs ==> 415.0
(experimental, auto) theano.sandbox.cuda.blas.GpuCorrMM ==> bprop weights ==> 580.0
(experimental, manual) theano.sandbox.cuda.blas.GpuCorrMM ==> fprop ==> 393.0
(experimental, manual) theano.sandbox.cuda.blas.GpuCorrMM ==> bprop inputs ==> 415.0
(experimental, manual) theano.sandbox.cuda.blas.GpuCorrMM ==> bprop weights ==> 582.0
I know these might seem slow compared to what is in the repo, but I don't have a Titan Black, just a 750 Ti.
It seems that the grad_input case is not really optimized or there is just too much copying when using the optimization. But when manually using it, it's fine.
Do you happen to have some kind of evaluation of the error rates across each platform? I'm curious to see how widely they vary from one approach to the next.
Thanks!