convnet-benchmarks icon indicating copy to clipboard operation
convnet-benchmarks copied to clipboard

[August 2014] Discussion of results

Open f0k opened this issue 9 years ago • 42 comments

This issue shall serve as a place to announce and discuss new results, to avoid the discussion being spread over several pull requests that just happened to be there (#11, #12).

So with 33f212238d, caffe seems to be really fast in the backward pass. Looking at the code, are you 100% sure that propagate_down is set to True? Otherwise it would time the gradient wrt. weights only.

f0k avatar Aug 05 '14 09:08 f0k

Are you sure caffe also updates the parameters (or accumulates the parameter gradients) in the code?

nicholas-leonard avatar Aug 05 '14 14:08 nicholas-leonard

yes, it does that here: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu#L79

soumith avatar Aug 05 '14 14:08 soumith

if we want a double confirmation, we can ask @rbgirshick who is watching this repo. Ross, what do you say?

soumith avatar Aug 05 '14 14:08 soumith

Yeah your link seems to confirm it. Yet the difference between forward and backward is so much smaller for caffe. I wonder what their secret is. Isn't SpatialConvolutionMM supposed to be using the same tricks anyhow?

nicholas-leonard avatar Aug 05 '14 14:08 nicholas-leonard

Isn't SpatialConvolutionMM supposed to be using the same tricks anyhow?

Yes, SpatialConvolutionMM is borrowed from Caffe (thanks guys), but we are using the cublas v1 interface (whereas Caffe uses cublas v2), we should fix that soon.

soumith avatar Aug 05 '14 14:08 soumith

Another question: When you time, do you synchronize with the device somewhere? (Otherwise it would synchronize on the next cudaMemset call or similarly, which may be hidden in the next caffe_net call.) I don't see it in your benchmark, but maybe layers[i]->Backward does more than just calling Backward_gpu or I missed something else.

f0k avatar Aug 05 '14 14:08 f0k

@f0k that does seem to be a good point! the utility I use is Caffe's own, and looking through (https://github.com/BVLC/caffe/blob/master/include/caffe/layer.hpp), I don't see an explicit synchronize either. I'll have to look into this further.

soumith avatar Aug 05 '14 14:08 soumith

I see, the code is here: https://github.com/BVLC/caffe/blob/master/src/caffe/util/benchmark.cpp So basically you insert a start and stop event into the stream (not sure if that's the correct terminology) with cudaEventRecord, then wait for the stop event to be run over, and finally return the time difference between the events with cudaEventElapsedTime. Seems proper, but maybe the other benchmarks should do the same to ensure only the GPU time is measured, or the caffe benchmark should time "from the outside" just like the others.

f0k avatar Aug 05 '14 15:08 f0k

but maybe the other benchmarks should do the same to ensure only the GPU time is measured, or the caffe benchmark should time "from the outside" just like the others.

I could write a caffe benchmark to time "from the outside", but I doubt the speed is going to change by much, if at all.

soumith avatar Aug 05 '14 15:08 soumith

@f0k I think a call cudaDeviceSynchronize after the calls to forward/backward. Simpler than events, and conforms to other benchmarks.

nicholas-leonard avatar Aug 05 '14 15:08 nicholas-leonard

@f0k I think a call cudaDeviceSynchronize after the calls to forward/backward. Simpler than events, and conforms to other benchmarks.

@nicholas-leonard I was just explaining what the currently used caffe benchmark code does, to answer my own question of whether it's properly synchronized. So yes, it is properly synchronized, but it does things differently than the other benchmarks.

f0k avatar Aug 05 '14 15:08 f0k

The Theano benchmark have a sync before the timming too. I don't know if that is useful, but I think consistency between software is better.:

    theano.sandbox.cuda.synchronize()
    start = time.time()
    for i in range(steps):
        fprop()
    theano.sandbox.cuda.synchronize()
    tm = (time.time()-start)/steps

On Tue, Aug 5, 2014 at 11:50 AM, Nicholas Léonard [email protected] wrote:

@f0k https://github.com/f0k I think a call cudaDeviceSynchronize after the calls to forward/backward. Simpler than events, and conforms to other benchmarks.

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/16#issuecomment-51217105 .

nouiz avatar Aug 05 '14 15:08 nouiz

Perhaps the gradient w.r.t. inputs should be left out of the L1 benchmark, because it would not actually be computed in the first layer of a real network. Theano results seem particularly skewed by this factor.

andravin avatar Aug 20 '14 19:08 andravin

I think it should be available as information somewhere, as for the middle layer it is computed. But maybe we could do 2 summary (fprop + grad weidght) and (fprop + the 2 grad).

On Wed, Aug 20, 2014 at 3:11 PM, Andrew Lavin [email protected] wrote:

Perhaps the gradient w.r.t. inputs should be left out of the L1 benchmark, because it would not actually be computed in the first layer of a real network. Theano results seem particularly skewed by this factor.

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/16#issuecomment-52828584 .

nouiz avatar Aug 20 '14 19:08 nouiz

@nouiz L1 is clearly not a middle layer, so the only necessary change is to remove the gradInput calculation from the L1 results.

andravin avatar Aug 20 '14 21:08 andravin

I benchmarked theano, torch, ccn2 and caffe with cuda 6.5. Also Alex updated ccn2 to add some more performance improvements. Also, fixed theano benchmark to average over 10 iterations (like the others), rather than taking the minimum of 10 iterations. Also added theano's corrMM to the table.

ccn2 is now a very close second to Caffe (caffe is cheating a bit in the benchmark because of timing just around the cuda kernel, and not the rest of the cpu cleanup around it).

soumith avatar Aug 26 '14 04:08 soumith

Hey @soumith , the *** next to corrMM is not correct (uses a lot of extra memory).

stencilman avatar Aug 26 '14 04:08 stencilman

right, fixed.

soumith avatar Aug 26 '14 04:08 soumith

Thanks! And thanks for updating the table.

stencilman avatar Aug 26 '14 04:08 stencilman

I agree with @andravin about omitting the L1 grad-wrt-inputs from the overall summary (at least in a special column). That really hurts the Theano legacy numbers, and it isn't even relevant for how these routines would actually be used.

ebattenberg avatar Sep 05 '14 17:09 ebattenberg

Thanks to @nouiz who fixed an issue with theano fft gradInput, now Theano FFT implementation is 1.5x faster than Caffe (which is 2nd place).

Now is the time to also start a conversation about 3x3 convolutions, as this year's Imagenet challenge had the VGG team lead by Karen Simonyan use just 3x3 convolutions across the whole network, starting from 224x224 image! http://arxiv.org/abs/1409.1556/

Should I add a few layer benchmarks that are relevant?

soumith avatar Sep 06 '14 19:09 soumith

Thanks a lot for sharing this paper!

Wow 16–19 weight layers?! Perhaps we are really moving towards the networks like in the brain. Too bad recurrent nets are still hard to train. We definitely also need to be able to train a model on multiple gpu's in the future. @soumith, does torch plan to do anything on these lines in the near future(multiple GPU support)?

stencilman avatar Sep 06 '14 19:09 stencilman

@stencilman it depends on how transparent you want it to be (multigpu can already be done in torch if you are careful with a few things), but this is not the thread to discuss it, let's open a thread in cutorch

soumith avatar Sep 06 '14 19:09 soumith

@soumith Yes, sorry, lets open a thread in cutorch.

stencilman avatar Sep 06 '14 19:09 stencilman

@ebattenberg I was waiting for this year's imagenet challenge to finish, but now you see why the GradInput and GradWeight numbers for L1 are starting to get relevant!

soumith avatar Sep 06 '14 19:09 soumith

Thanks to @nouiz who fixed an issue with theano fft gradInput, now Theano FFT implementation is 1.5x faster than Caffe (which is 2nd place).

Oh cool! Do you know what exactly was fixed?

f0k avatar Sep 08 '14 09:09 f0k

https://github.com/nouiz/Theano/commit/092e81e009cd531776ef804633f19be1d4f8787a

soumith avatar Sep 08 '14 09:09 soumith

About the next convolution benchmark, what about 3d convolution for video?

On Mon, Sep 8, 2014 at 5:37 AM, Soumith Chintala [email protected] wrote:

nouiz/Theano@092e81e https://github.com/nouiz/Theano/commit/092e81e009cd531776ef804633f19be1d4f8787a

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/16#issuecomment-54795755 .

nouiz avatar Sep 08 '14 14:09 nouiz

@nouiz I am totally up for it, Theano and Caffe have Volumetric convolutions!

soumith avatar Sep 08 '14 15:09 soumith

Added NVidia CuDNN results to table

soumith avatar Sep 10 '14 22:09 soumith

Although the CuDNN results in this benchmark are not too spectacular, I tried it out myself today with Theano, and compared it with some other available implementations (legacy conv2d, fftconv, GpuCorrMM).

CuDNN mainly seems to have an edge for convolutions with lots of small filters (i.e. 3x3). I saw speedups of almost 2x compared to GpuCorrMM, and it was often faster than fftconv as well (I didn't compare to cuda-convnet because the call signature is different).

Considering that the 2nd placed entry in this year's ImageNet competition was a very deep convnet with 3x3 filters everywhere, this might actually turn out to be an interesting configuration to benchmark, imho.

benanne avatar Sep 17 '14 19:09 benanne

if you see the torch7 folder, i benchmarked all the VGG layers here: https://github.com/soumith/convnet-benchmarks/blob/master/torch7/vgg.bench

soumith avatar Sep 17 '14 19:09 soumith

Interesting, from those results the conclusion would be pretty different, if I'm reading them correctly. I wonder why it makes such a big difference for what I tried (very simple network on MNIST, C3-C3-P2-C3-C3-P2-F-F-F). I guess there could be other parameters that play a role, or the fact that I'm using Theano, or maybe the GTX 680 I ran it on.

benanne avatar Sep 17 '14 19:09 benanne

@benanne: if you look at https://github.com/soumith/convnet-benchmarks/commit/4210e80c37da9d0d550317c433ee02d80f286a64, it seems to be that in Theano only the fprop has been optimized for CuDNN. Others not.

stencilman avatar Sep 17 '14 19:09 stencilman

@benanne the input and output image sizes also matter quite a bit. So MNIST wouldn't be representational of the performance of the VGG layers even though the filter sizes might be the same

soumith avatar Sep 17 '14 19:09 soumith

@stencilman I see, so it's just using the forward pass implementation for all three operations (forward, backward w.r.t. weights, backward w.r.t. input). But that doesn't explain why it seems to come out on top when I test it.

@soumith That, however, would explain some things :)

benanne avatar Sep 17 '14 19:09 benanne

Just added benchmarks for Maxime Oquab's (@qassemoquab) BHWD layout kernels. https://github.com/qassemoquab/nnbhwd They perform pretty decent for the current benchmark layers, but this benchmark doesn't really do justice. His module really shines when there are lots and lots of feature maps (applicable to certain domains). He does banded unrolling + sgemm

soumith avatar Sep 18 '14 04:09 soumith

@benanne: if you look at 4210e80 , it seems to be that in Theano only the fprop has been optimized for CuDNN.

@stencilman I see, so it's just using the forward pass implementation for all three operations (forward, backward w.r.t. weights, backward w.r.t. input).

Hmm, that's interesting. @abergeron convinced me that cuDNN shouldn't perform worse when just using cudnnConvolutionForward for everything, because it should select the optimal kernel to use under the hood -- e.g., asking it to do a forward pass with padding matching the kernel size should be as fast as asking it to do a backward pass wrt. input (full convolution). Maybe that's not the case. We should investigate and file a bug with Nvidia.

f0k avatar Sep 18 '14 13:09 f0k

To see if that is really a problem you can try the new version of the op that is https://github.com/Theano/Theano/pull/2117. If used directly the grad will use the Backward calls.

Also, I don't know what sort of timings you have, but I know that if you rely on the optimization to use cudnn for the gradient, you will end up with a bunch of potentially useless copies (because theano flips the kernels manually before the convolution and I have to do a call to gpu_contiguous). This might be the source of the slowdown on the gradient part.

It might be possible to avoid the gpu_contigous call in most cases, but cudnn officially does not support negative strides which tend to happen somewhat often in the convolution case. More work (but not too much) would be needed to make it work without it.

abergeron avatar Sep 18 '14 16:09 abergeron

I re-ran the test with this patch:

diff --git a/theano/pylearn2_benchmark.py b/theano/pylearn2_benchmark.py
index a110628..be4c75d 100644
--- a/theano/pylearn2_benchmark.py
+++ b/theano/pylearn2_benchmark.py
@@ -203,7 +203,17 @@ for run in runs:
         mode = theano.compile.get_default_mode()
         mode = mode.including('cudnn')
         benchmark_three_ways('(experimental, auto) theano.sandbox.cuda.dnn.GpuDnnConv',
-                                sharedX, sharedY, sharedW, X, Y, gW, gX, mode)
+                             sharedX, sharedY, sharedW, X, Y, gW, gX, mode)
+
+        mode = theano.compile.get_default_mode()
+        mode = mode.including('gpu')
+        dnnX = theano.sandbox.cuda.CudaNdarrayType((False, False, False, False))(name='dnnX')
+        dnnY = theano.sandbox.cuda.dnn.dnn_conv(sharedX, sharedW, border_mode='valid', subsample=(dw,dh))
+        dnngW = theano.grad(None, wrt=sharedW, known_grads={dnnY: sharedY})
+        dnngX = theano.grad(None, wrt=sharedX, known_grads={dnnY: sharedY})
+        benchmark_three_ways('(experimental, manual) '
+                             'theano.sandbox.cuda.dnn.GpuDnnConv',
+                             sharedX, sharedY, sharedW, dnnX, dnnY, dnngW, dnngX, mode)

     # benchmark caffe-like gemm convolution
     # Mimic Theano flag THEANO_FLAGS=optimizer_including=conv_gemm

And got these results:

Using gpu device 0: GeForce GTX 750 Ti
Note: pycuda not available, no timing via CUDA events possible

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
theano.tensor.nnet.conv.conv2d                                         ==> fprop           ==>     1885.0
theano.tensor.nnet.conv.conv2d                                         ==> bprop inputs    ==>    54169.0
theano.tensor.nnet.conv.conv2d                                         ==> bprop weights   ==>     1836.0

(experimental) theano.sandbox.cuda.fftconv.conv2d_fft fprop: FAILED
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop inputs: FAILED
(experimental) theano.sandbox.cuda.fftconv.conv2d_fft bprop weights: FAILED

(experimental, auto) theano.sandbox.cuda.dnn.GpuDnnConv                ==> fprop           ==>      268.0
(experimental, auto) theano.sandbox.cuda.dnn.GpuDnnConv                ==> bprop inputs    ==>     3262.0
(experimental, auto) theano.sandbox.cuda.dnn.GpuDnnConv                ==> bprop weights   ==>      596.0

(experimental, manual) theano.sandbox.cuda.dnn.GpuDnnConv              ==> fprop           ==>      269.0
(experimental, manual) theano.sandbox.cuda.dnn.GpuDnnConv              ==> bprop inputs    ==>      288.0
(experimental, manual) theano.sandbox.cuda.dnn.GpuDnnConv              ==> bprop weights   ==>      326.0

(experimental, auto) theano.sandbox.cuda.blas.GpuCorrMM                ==> fprop           ==>      393.0
(experimental, auto) theano.sandbox.cuda.blas.GpuCorrMM                ==> bprop inputs    ==>      415.0
(experimental, auto) theano.sandbox.cuda.blas.GpuCorrMM                ==> bprop weights   ==>      580.0

(experimental, manual) theano.sandbox.cuda.blas.GpuCorrMM              ==> fprop           ==>      393.0
(experimental, manual) theano.sandbox.cuda.blas.GpuCorrMM              ==> bprop inputs    ==>      415.0
(experimental, manual) theano.sandbox.cuda.blas.GpuCorrMM              ==> bprop weights   ==>      582.0

I know these might seem slow compared to what is in the repo, but I don't have a Titan Black, just a 750 Ti.

It seems that the grad_input case is not really optimized or there is just too much copying when using the optimization. But when manually using it, it's fine.

abergeron avatar Sep 19 '14 21:09 abergeron

Do you happen to have some kind of evaluation of the error rates across each platform? I'm curious to see how widely they vary from one approach to the next.

Thanks!

tranlm avatar Jun 29 '16 17:06 tranlm