convnet-benchmarks [April 2015] Revamp Benchmarks, move to Titan-X (Digits box)

[April 2015] Revamp Benchmarks, move to Titan-X (Digits box)

Open soumith opened this issue 9 years ago • 23 comments

List of libraries to rerun for Titan-X: Layer-wise benchmarks

[x] - Caffe
[x] - CuDNN
[x] - Torch
[x] - FBFFT
[x] - Theano
[x] - DeepCL
[x] - cuda-convnet2
[x] - NervanaGPU (limited release beta)
[ ] - cxxnet (maybe, last I tried, building and benchmarking was quite hard)
[ ] - ccv
[ ] - nnbhwd

Imagenet models benchmarks

[x] - Caffe @dadaism you said you'll help on this, any progress?
[x] - CuDNN
[x] - Torch
[x] - FBFFT
[x] - cuda-convnet2
[x] - NervanaGPU (limited release beta)
[ ] - Theano
[ ] - nnbhwd

Target date: April 15th, 2015

Initial multi-gpu benchmark (4-GPU).

[ ] - Torch
[ ] - Purine
[ ] - Caffe's parallel branch (maybe, if it all builds well, there's lots of issues on github about build issues + parallel)

Target date: April 24th, 2015

Apr 06 '15 18:04 soumith

Sweet, looking forward to this! Also slightly jealous ;)

Apr 06 '15 20:04 benanne

Second the Jealousy!

Apr 06 '15 20:04 liuliu

@soumith , thanks for the reminder. alexnet, vgg_a, overfeat are done. I will get googlenet done tonight and test a little bit. I will create PR on Tuesday (April 7) night. looking forward to seeing the results. : )

Apr 07 '15 00:04 dadaism

I just finished up benchmarking on Titan-X.

What's extremely exciting is how fast GPUs are going, and even more exciting is the fact that people are pushing the limits of GPUs everyday.

Starting with imagenet-winners, AlexNet, Overfeat, VGG-A

The BIG winner of this round is NervanaSys's Maxwell-only convolution kernels, written by @scott-gray et. al. A big shout-out to them. They gave me an early-release of their kernels because they're a small company and dont have the bandwidth to support a public release. If you're pretty serious about your convnets, drop them an email. They have both 16-bit and 32-bit kernels, both of them with impressive performance.

The FB-FFT kernels by Facebook come a close second. They've improved quite a bit since their release into open-source in December 2014; Nicolas Vasilache and Jeff Johnson ( @wickedfoo ) have been improving them constantly.

cuda-convnet2 is still leader of the old bunch even on Maxwell. NVIDIA's CuDNN comes 4th, but is much more flexible than cuda-convnet2 (no batch-size or feature-map restrictions) and FBFFT (free zero-padding).

On the Layer-wise benchmarks, FB-FFT is ahead of the pack by almost 4x. These are still somewhat useful because not all libraries have imagenet benchmarks. I could not benchmark Theano-FFT, because of an error that's showing up, but with @benanne 's help I should get the numbers out for them pretty soon as well.

Conclusion:

If you run imagenet-style networks, go bug the NervanaSys guys, their kernels are pulling numbers that make me do backflips!
If you have convnets with large kernel sizes (anything above 7x7), just go with FB-FFT. It's constantly being improved as well.

Cheers, S

Apr 17 '15 05:04 soumith

Wow, nice work!

Apr 17 '15 06:04 liuliu

So where are the benchmarks? :-)

I'm trying to decide if I should upgrade from my 6GB GTX 780s.

Thanks, Sean

On Thu, Apr 16, 2015 at 10:58 PM, Soumith Chintala <[email protected]

wrote:

I just finished up benchmarking on Titan-X.

What's extremely exciting is how fast GPUs are going, and even more exciting is the fact that people are pushing the limits of GPUs everyday.

Starting with imagenet-winners, AlexNet, Overfeat, VGG-A

The BIG winner of this round is NervanaSys's Maxwell-only convolution kernels, written by @scott-gray https://github.com/scott-gray et. al. A big shout-out to them. They gave me an early-release of their kernels because they're a small company and dont have the bandwidth to support a public release. If you're pretty serious about your convnets, drop them an email. They have both 16-bit and 32-bit kernels, both of them with impressive performance.

The FB-FFT kernels by Facebook come a close second. They've improved quite a bit since their release into open-source in December 2014; Nicolas Vasilache and Jeff Johnson ( @wickedfoo https://github.com/wickedfoo ) have been improving them constantly.

cuda-convnet2 is still leader of the old bunch even on Maxwell. NVIDIA's CuDNN comes 4th, but is much more flexible than cuda-convnet2 (no batch-size or feature-map restrictions) and FBFFT (free zero-padding).

On the Layer-wise benchmarks, FB-FFT is ahead of the pack by almost 4x. These are still somewhat useful because not all libraries have imagenet benchmarks. I could not benchmark Theano-FFT, because of an error that's showing up, but with @benanne https://github.com/benanne 's help I should get the numbers out for them pretty soon as well.

Conclusion:

If you run imagenet-style networks, go bug the NervanaSys guys, their kernels are pulling numbers that make me do backflips!

If you have convnets with large kernel sizes (anything above 7x7), just go with FB-FFT. It's constantly being improved as well.

Cheers, S

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/38#issuecomment-93899812 .

Apr 17 '15 06:04 skelleher

@skelleher right on the front-page readme: https://github.com/soumith/convnet-benchmarks/blob/master/README.md

It's interesting to compare just the GPUs themselves. For example: Titan-X vs Titan Black (which is slightly faster than 780, as fast as 780Ti): Titan-Black: https://github.com/soumith/convnet-benchmarks/tree/de7cfc32a93f5e14863666e3d3636aca662824fa#imagenet-winners-benchmarking

Titan-X: https://github.com/soumith/convnet-benchmarks#imagenet-winners-benchmarking

Titan-X is quite a bit faster. Just comparing CuDNN R2 on both cards, Alexnet is 1.5x faster on Titan-X. And if you take into account Nervana's kernels, Titan-X becomes 3.7x faster than Titan-Black.

Apr 17 '15 06:04 soumith

Awesome work @soumith, thanks for doing this :)

Those are some incredible numbers! I knew we had some leeway on Maxwell since almost all the code out there right now seems to be Kepler-optimized, but I was expecting the gains to be quite a bit smaller. I didn't know fp16 was possible yet, either! Really impressive stuff. Kudos to @scott-gray!

How 'flexible' are the Nervana kernels in terms of sizes/strides etc.? Since that seems to be the main benefit of cuDNN right now. I'm really looking forward to cuDNN v3 because I believe that's supposed to come Maxwell-optimized as well (and won't require any extensive wrapping).

Regarding Theano's FFT convolution - I don't think it's worth spending time on that anymore to be quite honest. It was a "lazy" wrapper in pure Python (using scikits.cuda) so it was never intended to be competitive and supposed to be replaced by a C-based wrapper anyway (there were plans for this at some point but I don't know what the current situation is). I only wrote it because there was nothing else at the time.

Apr 17 '15 07:04 benanne

My kernels are at least as flexible as cuDNN (though I left out upscaling and cross-correlation mode as I don't think they're used much). The kernels are wrapped in a fairly easy to use python wrapper complete with automatically compounding elementwise operations. Here is the interface to the conv and pooling layers:

    def conv_layer(self, dtype,
            N, C, K, 
            D=1, H=1, W=1,
            T=1, R=1, S=1,
            pad_d=0, pad_h=0, pad_w=0,
            str_d=1, str_h=1, str_w=1):
        """
        Create a new ConvLayer parameter object.

        N: Number of images in mini-batch
        C: Number of input feature maps
        K: Number of output feature maps

        D: Depth  of input image
        H: Height of input image
        W: Width  of input image

        T: Depth  of filter kernel
        R: Height of filter kernel
        S: Width  of filter kernel

        padding: amount of zero-padding around the given edge
        strides: factor to step the filters by in a given direction
        """

    def pool_layer(self, dtype,
            op, N, C, 
            D=1, H=1, W=1,
            J=1, T=1, R=1, S=1,
            pad_j=0, pad_d=0, pad_h=0, pad_w=0,
            str_j=None, str_d=None, str_h=None, str_w=None):
        """
        Create a new PoolLayer parameter object.
        This then is passed as an argument to all pooling kernels.

        op: max, avg, l2 pooling
        N: Number of images in mini-batch

        C: Number of input feature maps
        D: Depth  of input image
        H: Height of input image
        W: Width  of input image

        J: Size of feature map pooling window (maxout n_pieces)
        T: Depth  of pooling window
        R: Height of pooling window
        S: Width  of pooling window

        padding: amount of zero-padding around the given image or feature map edge
        strides: factor to step the window by in a given direction (overlap allowed)

        Leave spatial dimensions at 1 to allow feature map pooling in the fc layers.
        """

Also included in the lib is a full set of gemm kernels for fp16 and fp32 that generally far outperform cublas, particularly at small minibatch sizes.

When using fp16, stochastic rounding is also available for all kernels. Additionally all gemm and conv kernels have the option to apply a relu prior to output, saving you some elementwise overhead.

For fp16 we're currently developing techniques to train large networks with generally the same performance as fp32. We should be publishing some of those results in the near future.

Oh and all of this (except for the element-wise stuff and python wrapper) was written in shader assembly using maxas (https://github.com/NervanaSystems/maxas).

Apr 17 '15 07:04 scott-gray

That is awesome news! That Python wrapper looks really sweet as well, it should make Theano integration a breeze! I'd be very interested to play around with it :)

Apr 17 '15 08:04 benanne

Very cool indeed!

it should make Theano integration a breeze!

The license prevents incorporating it directly in Theano, but it should be possible to do a separate module providing Theano Ops and registering the necessary optimizers. However, having a closer look at the code, the provided wrappers seem to assume the data is provided in fp16 format on CPU. For Theano, we'd instead need Ops for converting float32 to fp16 and back on the device (using code from flexpt_ew.py, I guess), and Ops performing the different convolution (and possibly pooling) passes, as well as gemm. So there's still a bit of work to do. I'm not really sure where to discuss this further; the license ties everything close to Nervana Systems (requiring prior approval of resulting publications), so maybe this should be discussed on their repository rather than Theano's. I'll keep that on my mind until some upcoming paper deadlines have passed and some Maxwell GPUs have arrived :)

Apr 20 '15 16:04 f0k

@f0k The repo I benchmarked is: https://github.com/NervanaSystems/nervanagpu . It is private and you need to request access from the Nervana guys. In this new repo they have ops that take in float32 as well.

Apr 20 '15 17:04 soumith

@f0k, the current license for the nervanagpu library is a place-holder. When we release, it will be an open source variant allowing you to integrate and publish freely. We have a big unrelated project in-house we need to complete in the next few weeks. We should have time afterwards to clean up and release this library. In the interim, I'd appreciate feedback under Issues in the nervanagpu repo.

Apr 20 '15 17:04 khosra

Just to note, we also have access to it and have part of it already wrapped in Theano. If you have access to the Nervana code, I can give access to the wrapper.

On Mon, Apr 20, 2015 at 1:37 PM, Amir Khosrowshahi <[email protected]

wrote:

@f0k https://github.com/f0k, the current license for the nervanagpu library is a place-holder. When we release, it will be an open source variant allowing you to integrate and publish freely. We have a big unrelated project in-house we need to complete in the next few weeks. We should have time afterwards to clean up and release this library. In the interim, I'd appreciate feedback under Issues in the nervanagpu repo.

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/38#issuecomment-94518363 .

Apr 20 '15 17:04 nouiz

@soumith, hope I don't hijack your thread, but if you want access to the nervanagpu repo, please post here and I will add your github id. (Also please consider joining Nervana -- we are looking for engineers to help @scott-gray with these kernels.)

Apr 20 '15 18:04 khosra

Now that's a wealth of great news! :) I don't have a Maxwell GPU to work with yet, but it can't hurt to have a look and follow along with the Theano integration. @khosra, I'd be glad to get access to the nervanagpu repository, thanks!

Apr 20 '15 20:04 f0k

sorry @soumith for being off-topic, but @khosra - is it possible for me to get access to nervanagpu repository as well? I would love to play around with them. I was a post doc at NYU, and have helped with porting caffe convolutions to theano at some pt (along with @f0k and others) and 3D convolution for torch. Thanks!

Apr 21 '15 03:04 stencilman

Completely forgot, @hughperkins 's DeepCL numbers added. DeepCL is unique in the sense that it's the only real contender at this point for OpenCL based deep learning, Huge props to Hugh for taking it upon himself to make all the AMD card owners happy. The numbers dont look very great compared to the CUDA based stuff, but the CUDA stuff started pretty slow as well. So in due time, I'm sure OpenCL will catch up to reasonable numbers.

Apr 21 '15 03:04 soumith

@khosra I would be interested as well! I have a Maxwell GPU and some free time in the next few weeks.

@soumith Interesting to see some OpenCL numbers as well. Thanks!

Apr 21 '15 08:04 JeffreyDF

Great! Thank-you Soumith. Really great to have a place on your benchmarking board :-)

Apr 21 '15 10:04 hughperkins

Good job! Thanks

Apr 22 '15 09:04 wepe

@soumith, we made our library public with Apache license. Again, it's here if anyone wants to try it: https://github.com/NervanaSystems/nervanagpu.

May 04 '15 05:05 khosra

Gooood job! I wanna to know how about the cxxnet rank in these benchmark. And how about train a same vgg-like model on cifar dataset? the total train time is worth comparable.

Jun 15 '15 06:06 dzhwinter

convnet-benchmarks convnet-benchmarks copied to clipboard

[April 2015] Revamp Benchmarks, move to Titan-X (Digits box)

convnet-benchmarks
convnet-benchmarks copied to clipboard