convnet-benchmarks icon indicating copy to clipboard operation
convnet-benchmarks copied to clipboard

DeepMark

Open soumith opened this issue 8 years ago • 125 comments

Hi all,

The reason I've been slow on convnet-benchmarks these days is because i've been working on the side on DeepMark.

I initially wrote convnet-benchmarks to increase competition among frameworks so that we can work towards faster ConvNets, and they served their purpose well. After the release of convnet-benchmarks, multiple frameworks pulled up their socks to speedup convnets, with a deep sense of prestige for being on top of these benchmarks. In these two years, we as a community accelerated GPU ConvNets across all frameworks between 4x to 10x, efficiently implementing tricks such as FFT, Winograd, and powered by faster hardware. Alex Khrizevsky, Yangqing Jia, Scott Gray, Nicolas Vasilache, Sander Dieleman, Michael Mathieu, Julien Denmouth and many other human compilers helped make this a reality -- looking at the diversity in terms of where each of us work(ed) professionally shows that this kind of acceleration was truly a community effort with a ton of openness, something that is plain awesome! :) I've also enjoyed reading the deeply technical discussions that take place on convnet-benchmarks (my favorites: https://github.com/soumith/convnet-benchmarks/issues/93 , https://github.com/soumith/convnet-benchmarks/issues/66 , https://github.com/soumith/convnet-benchmarks/issues/59 in recent times ).

Moving on, convnet-benchmarks do not accurately capture everything we think of when we say deep learning. We don't have Recurrent Nets, we don't have video use-cases, speech, NLP etc. There is a need for such comprehensive benchmarks, especially as the space is getting ready for dedicated hardware chips, multi-GPU and multi-machine frameworks and more complex use-cases.

I've sat down with a few of you at NIPS and GTC to discuss and freeze the initial round of benchmarks for what I am calling DeepMark. My initial plan was to work on the initial set of benchmark scripts by myself and cover the most popular frameworks, and then let the direction and maintenance of the benchmarks be community-driven. But the breadth of this effort has been overwhelming to say the least. After careful thought, I've decided that I'll just ask everyone to pitch in for their part of the benchmarks with making scripts etc., especially as many of you were very receptive to the idea offline.

Here are the initial set of use-cases we want to cover:

Networks

Images

  • InceptionV3-batchnorm (http://arxiv.org/abs/1512.00567 , https://github.com/Moodstocks/inception-v3.torch)
  • Alexnet-OWT
  • VGG
  • ResNet-50 ( http://arxiv.org/abs/1512.03385 , https://github.com/facebook/fb.resnet.torch )

Video

  • C3D - A vgg-style 3D net ( http://vlg.cs.dartmouth.edu/c3d/ )

Audio

  • DeepSpeech2 - Convnet + RNN + FC ( http://arxiv.org/abs/1512.02595 )
  • MSR's 5 layer FC net ( https://github.com/Alexey-Kamenev/Benchmarks )

Text

  • Small RNN LSTM ( https://github.com/karpathy/char-rnn/blob/master/train.lua#L38-L48 )
  • Large RNN LSTM ( BIG-LSTM in http://arxiv.org/abs/1602.02410 )

Platform

  • Initially multi-GPU with (1 to 4 titan-X cards)
  • However, multi-machine, custom hardware, other GPU cards such as AMD, CPUs etc. can and should be accommodated, we will work this out after the initial push.

Metrics

  • Round-trip time for 1 epoch of training (will define an epoch size separately for each network)
  • Maximum batch-size that fits (to show and focus on the extra memory consumption that the framework uses)

Frameworks

Everyone who wants to join-in, but I thought an initial set that is important to cover would be:

  • Caffe
  • Chainer
  • MXNet
  • Neon
  • Theano
  • TensorFlow
  • Torch

Scripts format

  • Emit JSON output (so that the README -- or jekyll website can be auto-generated, similar to http://autumnai.com/deep-learning-benchmarks )

Guarantees

  • I will personally to the best of my abilities make sure that the benchmarking is fair and unbiased. The hope is that the community at large will watch these and point-out / fix mistakes.

Governance

  • The benchmarks will be placed at https://github.com/DeepMark/deepmark and other key community members / organizations who want ownership will be welcome to join in proposing new benchmarks that get relevant as the field progresses.

Timeframe

  • Initial Release: June 15th, 2016

My hope is that these new set of benchmarks will not only increase competition but will also be beneficial in other ways to the community, serving as common examples to get started, etc.

Let me know what you think :) Soumith

cc: @hughperkins @f0k @scott-gray @rajatmonga @vrv @benanne @nouiz @Yangqing @tqchen @unnonouno

soumith avatar Apr 14 '16 23:04 soumith

Oh lastly, a good timeline for this would be to get an initial round of benchmarks by June 15th (since I only gave some of you a heads-up right now)

soumith avatar Apr 14 '16 23:04 soumith

So awesome and useful. What are the data sets one should benchmark on? ImageNet, CIFAR10? It would also be nice to compare the accuracy of current implementations for each framework (although that would probably be a lot of work).

daviddao avatar Apr 14 '16 23:04 daviddao

For text, I'd hope to expand beyond just RNN character generation. It doesn't capture many of the complexities of other models, such as variable sequence lengths or bidirectional RNNs.

The Attention Sum Reader is a simple architecture (bidirectional GRU + dot product) that currently has SotA and could allow for optimizing sequences of different lengths, a major issue in RNNs. The model also has four different dataset sizes, small (Children's Book Test), medium (CNN), and large (Daily Mail), which are publicly available.

Smerity avatar Apr 15 '16 01:04 Smerity

This is great, thanks for organizing this! One thing I've also been thinking about like @daviddao is how to validate that the models are actually computing the same thing -- I've seen some benchmarks elsewhere that have raised personal doubts that the benchmark code written in different frameworks are computing the same function. As part of the benchmark framework, maybe the API could include a way to validate that given specified initialization and input, the outputs (forward and backward) are approximately equal. Open to thoughts :). Cheers!

vrv avatar Apr 15 '16 02:04 vrv

Nice! I am very excited for this.

I have https://github.com/craffel/lstm_benchmarks, which is an out-of-date benchmark of Theano vs. rnnlib vs. currennt (which, at the time I wrote the benchmarks, were essentially the only options for LSTM). The task was CHIME noisy speech recognition, which has pretty limited adoption so I would not strongly advocate for it being added as a task. And I assume that rnnlib and current shouldn't be included in these benchmarks are they are RNN-only, right?

I'll be happy to contribute to some of the Theano RNN benchmarks once it becomes appropriate to do so.

This is great, thanks for organizing this! One thing I've also been thinking about like @daviddao is how to validate that the models are actually computing the same thing -- I've seen some benchmarks elsewhere that have raised personal doubts that the frameworks are computing the same function.

This would be very cool, but from my own experience with the LSTM benchmark it can be very difficult - you have to make sure literally every hyperparameter is identical, and you effectively can't use any RNGs. Not to say it's impossible, but it would add a lot of overhead to implementing new benchmarks.

craffel avatar Apr 15 '16 03:04 craffel

Caveat: per Paul Graham, better to go deep, do something very well, than kind of blur one's 'focus' over many things. I worry gently that if too many benchmarks then:

  • each benchmark less well maintained
  • more confusing to read

hughperkins avatar Apr 15 '16 08:04 hughperkins

:+1:

What are the data sets one should benchmark on? ImageNet, CIFAR10?

Training on something like ImageNet would move away the focus from pure computation to fast dataset iteration -- this would be interesting as well, but should probably become a separate benchmark since not all frameworks actually provide any tools for this. The other extreme would be training on random dummy data (like sampled from a Gaussian), but this makes sense only if we can guarantee the running time does not depend on the input data. So probably we should have some realistic set of inputs for each task, just large enough to fill two batches or so?

As part of the benchmark framework, maybe the API could include a way to validate that given specified initialization and input, the outputs (forward and backward) are approximately equal.

This seems useful. It requires initial model parameters to be dumped in some format and loaded into each framework, but it would help to ensure that all implementations are the same.

f0k avatar Apr 15 '16 10:04 f0k

This seems useful. It requires initial model parameters to be dumped in some format and loaded into each framework, but it would help to ensure that all implementations are the same.

In the strongest case, weight initialization could be defined precisely as:

  • a precise order of initialization, eg by layer, then by infeatureplane, then by outfeatureplane, then by height, then by width (for example)
  • a precise random function to use (eg mt19937)
  • the exact seed to use
  • (edit, and of course the exact function to use, eg sqrt(numberinputs) * 0.1, or similar)

hughperkins avatar Apr 15 '16 13:04 hughperkins

In the strongest case, weight initialization could be defined precisely as [...]

I guess getting the same stream of pseudo-random values in all different frameworks is more difficult than importing a set of tensors into all different frameworks. We wouldn't want to exclude candidates from being benchmarked because they fail to implement the same RNG.

f0k avatar Apr 15 '16 13:04 f0k

I guess getting the same stream of pseudo-random values in all different frameworks is more difficult than importing a set of tensors into all different frameworks. We wouldn't want to exclude candidates from being benchmarked because they fail to implement the same RNG.

Having gone through the exact same process, to compare DeepCL with convnetjs, I found it significantly easier to make convnetjs use the exact same weight generator as DeepCL, than to load weights from a file https://github.com/hughperkins/DeepCL/blob/master/prototyping/convnetjs-reference/testconvnet2.js#L143-L201 . It was a long time ago, so I dont remember why. I do remember I initially tried writing weights to a file though, and I couldnt get it to work as easily as syncing weight generators, for some reason.

hughperkins avatar Apr 15 '16 14:04 hughperkins

I found it significantly easier to make convnetjs use the exact same weight generator as DeepCL, than to load weights from a file

If that's the case, one could of course create the initial weights in a reproducible way and save them to files, so implementers for the different benchmarked frameworks can choose whatever is easiest. (Note that loading from files has the additional benefit of documenting how to load foreign model parameters into each framework.) Umm... are we supposed to discuss such details here or over at the deepmark repository?

f0k avatar Apr 15 '16 16:04 f0k

@daviddao @vrv @hughperkins @f0k for V1, I thought we should just go with synthetic data. It's very hard to setup to-convergence benchmarks, as there are very fine details wrt convergence being guaranteed, for example: some of the models (like googlenetv3) have taken a year to reproduce outside of the paper.

@Smerity In terms of evaluating perf, we can add a bidirectional RNN too. In fact, DeepSpeech2 has bidirectional-RNNs, so that should be sufficient?

@vrv definitely a great idea, but very very hard and takes a ton of resources. I feel like atleast for V1, we should just go with code review + synthetic data.

@craffel Awesome! At the moment I dont have a Point of Contact for Theano, maybe a combination of you, @f0k and @benanne could work (especially if they're implemented in Lasagne).

soumith avatar Apr 15 '16 16:04 soumith

(especially if they're implemented in Lasagne).

That would be nice :) though I am personally interested in which of the Theano-based libraries manage to eek out the most performance, since their implementations are nonidentical.

craffel avatar Apr 15 '16 16:04 craffel

Thanks @soumith for organizing this effort! I think this would definitely help us advance the field to the next level.

I am also very interested in benchmarking not only the training pipeline, but a more wide range of evaluation criteria. The reason is as follows: if I may make a bold claim, I believe that all frameworks will again very quickly converge to the same performance, because there is no fundamental difference between them. What we saw at convnet-benchmark is that almost everyone is using the same underlying library, and we are effectively benchmarking framework overheads, something that is good to know of course, but seems to be overwhelmed by other factors, such as ease to use etc.

Given the wide attention of this benchmark, I think it would be great if we can draw attention to some of the more practical issues, such as small batch sizes in deployment time - several frameworks (including some non-open-source production systems I've worked on) have historically ignored this, and I think it is worthwhile to invite people to invest more on this direction.

I have not got a perfece idea on this, of course. One thing we can do is to simply benchmark different batch sizes, but a more complex, and potentially useful, way is probably to set up a harness that can simulate requests generated from a Poisson distribution and comes with latency requirements, and see whether frameworks can address that in an optimal fashion - this might be too application specific, though. Just my 2 cents.

(Also adding @ajtulloch to the conversation. Andrew first raised this point when we were discussing offline.)

Yangqing avatar Apr 15 '16 16:04 Yangqing

How about per-pixel scene labelling and optical flow?

Regards

-David On 15 Apr 2016 00:41, "Soumith Chintala" [email protected] wrote:

Hi all,

The reason I've been slow on convnet-benchmarks these days is because i've been working on the side on DeepMark.

I initially wrote convnet-benchmarks to increase competition among frameworks so that we can work towards faster ConvNets, and they served their purpose well. After the release of convnet-benchmarks, multiple frameworks pulled up their socks to speedup convnets, with a deep sense of prestige for being on top of these benchmarks. In these two years, we as a community accelerated GPU ConvNets across all frameworks between 4x to 10x, efficiently implementing tricks such as FFT, Winograd, and powered by faster hardware. Alex Khrizevsky, Yangqing Jia, Scott Gray, Nicolas Vasilache, Sander Dieleman, Michael Mathieu, Julien Denmouth and many other human compilers helped make this a reality -- looking at the diversity in terms of where each of us work(ed) professionally shows that this kind of acceleration was truly a community effort with a ton of openness, something that is plain awesome! :) I've also enjoyed reading the deeply technical discussions that take place on convnet-benchmarks (my favorites: #93 https://github.com/soumith/convnet-benchmarks/issues/93 , #66 https://github.com/soumith/convnet-benchmarks/issues/66 , #59 https://github.com/soumith/convnet-benchmarks/issues/59 in recent times ).< /p>

Moving on, convnet-benchmarks do not accurately capture everything we think of when we say deep learning. We don't have Recurrent Nets, we don't have video use-cases, speech, NLP etc. There is a need for such comprehensive benchmarks, especially as the space is getting ready for dedicated hardware chips, multi-GPU and multi-machine frameworks and more complex use-cases.

I've sat down with a few of you at NIPS and GTC to discuss and freeze the initial round of benchmarks for what I am calling DeepMark. My initial plan was to work on the initial set of benchmark scripts by myself and cover the most popular frameworks, and then let the direction and maintenance of the benchmarks be community-driven. But the breadth of this effort has been overwhelming to say the least. After careful thought, I've decided that I'll just ask everyone to pitch in for their part of the benchmarks with making scripts etc., especially as many of you were very receptive to the idea offline.

Here are the initial set of use-cases we want to cover: Networks Images

  • InceptionV3-batchnorm (http://arxiv.org/abs/1512.00567 , https://github.com/Moodstocks/inception-v3.torch)
  • Alexnet-OWT
  • VGG
  • ResNet-50 ( http://arxiv.org/abs/1512.03385 , https://github.com/facebook/fb.resnet.torch )

Video

  • C3D - A vgg-style 3D net ( http://vlg.cs.dartmouth.edu/c3d/ )

Audio

  • DeepSpeech2 - Convnet + RNN + FC ( http://arxiv.org/abs/1512.02595 )
  • MSR's 5 layer FC net ( https://github.com/Alexey-Kamenev/Benchmarks )

Text

  • Small RNN LSTM ( https://github.com/karpathy/char-rnn/blob/master/train.lua#L38-L48 )
  • Large RNN LSTM ( BIG-LSTM in http://arxiv.org/abs/1602.02410 )

Platform

  • Initially multi-GPU with (1 to 4 titan-X cards)
  • However, multi-machine, custom hardware, other GPU cards such as AMD, CPUs etc. can and should be accommodated, we will work this out after the initial push.

Metrics

  • Round-trip time for 1 epoch of training (will define an epoch size separately for each network)
  • Maximum batch-size that fits (to show and focus on the extra memory consumption that the framework uses)

Frameworks

Everyone who wants to join-in, but I thought an initial set that is important to cover would be:

  • Caffe
  • Chainer
  • MXNet
  • Neon
  • Theano
  • TensorFlow
  • Torch

Scripts format

  • Emit JSON output (so that the README -- or jekyll website can be auto-generated, similar to http://autumnai.com/deep-learning-benchmarks )

Guarantees

  • I will personally to the best of my abilities make sure that the benchmarking is fair and unbiased. The hope is that the community at large will watch these and point-out / fix mistakes.

Governance

  • The benchmarks will be placed at https://github.com/DeepMark/deepmark and other key community members / organizations who want ownership will be welcome to join in proposing new benchmarks that get relevant as the field progresses.

My hope is that these new set of benchmarks will not only increase competition but will also be beneficial in other ways to the community, serving as common examples to get started, etc.

Let me know what you think :) Soumith

cc: @hughperkins https://github.com/hughperkins @f0k https://github.com/f0k @scott-gray https://github.com/scott-gray @rajatmonga https://github.com/rajatmonga @vrv https://github.com/vrv @bennane @nouiz https://github.com/nouiz @Yangqing https://github.com/Yangqing @tqchen https://github.com/tqchen @unnonouno https://github.com/unnonouno

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/101

moloned avatar Apr 15 '16 17:04 moloned

I agree that evaluate more aspects. Some of them is already covered in this proposal, for example

  • Memory consumption
  • Parallelization and Scheduling overhead

One of the most important factor, tradeoff between ease of use and optimization, was unfortunately not easy to benchmark as each people have their own taste.

What @Yangqing suggested is more on measuring perf for production and serving pipeline, which could be a whole area of new directions. As this benchmark was primarily on training. One alternative could be making a deep-serving benchmark on DeepMark organization that dedicate to this topic.

tqchen avatar Apr 15 '16 17:04 tqchen

It would be interesting to log the power dissipation in each testcase, as well as fps, memory BW, FLOPS etc. On 15 Apr 2016 00:41, "Soumith Chintala" [email protected] wrote:

Hi all,

The reason I've been slow on convnet-benchmarks these days is because i've been working on the side on DeepMark.

I initially wrote convnet-benchmarks to increase competition among frameworks so that we can work towards faster ConvNets, and they served their purpose well. After the release of convnet-benchmarks, multiple frameworks pulled up their socks to speedup convnets, with a deep sense of prestige for being on top of these benchmarks. In these two years, we as a community accelerated GPU ConvNets across all frameworks between 4x to 10x, efficiently implementing tricks such as FFT, Winograd, and powered by faster hardware. Alex Khrizevsky, Yangqing Jia, Scott Gray, Nicolas Vasilache, Sander Dieleman, Michael Mathieu, Julien Denmouth and many other human compilers helped make this a reality -- looking at the diversity in terms of where each of us work(ed) professionally shows that this kind of acceleration was truly a community effort with a ton of openness, something that is plain awesome! :) I've also enjoyed reading the deeply technical discussions that take place on convnet-benchmarks (my favorites: #93 https://github.com/soumith/convnet-benchmarks/issues/93 , #66 https://github.com/soumith/convnet-benchmarks/issues/66 , #59 https://github.com/soumith/convnet-benchmarks/issues/59 in recent times ).< /p>

Moving on, convnet-benchmarks do not accurately capture everything we think of when we say deep learning. We don't have Recurrent Nets, we don't have video use-cases, speech, NLP etc. There is a need for such comprehensive benchmarks, especially as the space is getting ready for dedicated hardware chips, multi-GPU and multi-machine frameworks and more complex use-cases.

I've sat down with a few of you at NIPS and GTC to discuss and freeze the initial round of benchmarks for what I am calling DeepMark. My initial plan was to work on the initial set of benchmark scripts by myself and cover the most popular frameworks, and then let the direction and maintenance of the benchmarks be community-driven. But the breadth of this effort has been overwhelming to say the least. After careful thought, I've decided that I'll just ask everyone to pitch in for their part of the benchmarks with making scripts etc., especially as many of you were very receptive to the idea offline.

Here are the initial set of use-cases we want to cover: Networks Images

  • InceptionV3-batchnorm (http://arxiv.org/abs/1512.00567 , https://github.com/Moodstocks/inception-v3.torch)
  • Alexnet-OWT
  • VGG
  • ResNet-50 ( http://arxiv.org/abs/1512.03385 , https://github.com/facebook/fb.resnet.torch )

Video

  • C3D - A vgg-style 3D net ( http://vlg.cs.dartmouth.edu/c3d/ )

Audio

  • DeepSpeech2 - Convnet + RNN + FC ( http://arxiv.org/abs/1512.02595 )
  • MSR's 5 layer FC net ( https://github.com/Alexey-Kamenev/Benchmarks )

Text

  • Small RNN LSTM ( https://github.com/karpathy/char-rnn/blob/master/train.lua#L38-L48 )
  • Large RNN LSTM ( BIG-LSTM in http://arxiv.org/abs/1602.02410 )

Platform

  • Initially multi-GPU with (1 to 4 titan-X cards)
  • However, multi-machine, custom hardware, other GPU cards such as AMD, CPUs etc. can and should be accommodated, we will work this out after the initial push.

Metrics

  • Round-trip time for 1 epoch of training (will define an epoch size separately for each network)
  • Maximum batch-size that fits (to show and focus on the extra memory consumption that the framework uses)

Frameworks

Everyone who wants to join-in, but I thought an initial set that is important to cover would be:

  • Caffe
  • Chainer
  • MXNet
  • Neon
  • Theano
  • TensorFlow
  • Torch

Scripts format

  • Emit JSON output (so that the README -- or jekyll website can be auto-generated, similar to http://autumnai.com/deep-learning-benchmarks )

Guarantees

  • I will personally to the best of my abilities make sure that the benchmarking is fair and unbiased. The hope is that the community at large will watch these and point-out / fix mistakes.

Governance

  • The benchmarks will be placed at https://github.com/DeepMark/deepmark and other key community members / organizations who want ownership will be welcome to join in proposing new benchmarks that get relevant as the field progresses.

My hope is that these new set of benchmarks will not only increase competition but will also be beneficial in other ways to the community, serving as common examples to get started, etc.

Let me know what you think :) Soumith

cc: @hughperkins https://github.com/hughperkins @f0k https://github.com/f0k @scott-gray https://github.com/scott-gray @rajatmonga https://github.com/rajatmonga @vrv https://github.com/vrv @bennane @nouiz https://github.com/nouiz @Yangqing https://github.com/Yangqing @tqchen https://github.com/tqchen @unnonouno https://github.com/unnonouno

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/101

moloned avatar Apr 15 '16 17:04 moloned

It would be interesting to log the power dissipation in each testcase

I like this idea. A Titan draws 250watts peak (I think?). 24 hours a day for a year, 250watts is about ~600usd, which is in the same order of magnitude as the purchase price.

And power dissipation is going to become the main bottleneck plausibly in years to come. ("And over here we have our farm of 1000 Titan 2026s, and over there is the 20MW pebble bed we are using to power them" :-) )

hughperkins avatar Apr 15 '16 17:04 hughperkins

@Yangqing

"if I may make a bold claim, I believe that all frameworks will again very quickly converge to the same performance, because there is no fundamental difference between them."

Agreed. Soumith's current benchmarks are useful, but they mainly evaluate "who can make the thinnest wrapper around cuDNN, Neon, or similar?"

It would be useful to benchmark implementations of emerging algorithms for which tuned libraries may not yet exist -- certain versions of LSTMs and RNNs, for instance.

forresti avatar Apr 16 '16 00:04 forresti

@forresti yea, for historical context it used to be not like that, but it is like that now. I think for LSTMs and RNNs, a lot of perf is still up for grabs.

soumith avatar Apr 16 '16 00:04 soumith

Agreed. Soumith's current benchmarks are useful, but they mainly evaluate "who can make the thinnest wrapper around cuDNN, Neon, or similar?"

To be fair, cudnn, neon are competing with each other. The opencl implementations mostly copy the caffe cuda implementation of im2col as far as I know :-D but have different performance from cudnn. There is also 16-bit vs 32-bit.

(Edit, by the way, the cudnn vs neon comparison is exactly what comes to mind about power consumption. I dont know if it's still the case, but as far as I know it used to be the case that cudnn ran cooler than neon, and it'd be useful to be able to see this in the results)

hughperkins avatar Apr 16 '16 00:04 hughperkins

@hughperkins Good point. I didn't mean to imply that there isn't a diverse array of low-level computational libraries for DNNs.

To tune up my comment a bit: "When doing speed/efficiency benchmarks, it's hard to avoid conflating low-level computational libraries (cuDNN, Neon, various OpenCL efforts, ...) and higher-level frameworks (Caffe, Torch, Tensorflow, Theano, ...)."

forresti avatar Apr 16 '16 01:04 forresti

I would say that convolution is far from a solved problem. I still have a long list of optimizations I want to make. The biggest area to explore is how to best leverage lower precision without sacrificing accuracy. The obvious target there would be xnor nets but maybe a bit more precision is best for the highest levels of accuracy. The 4x int8 performance that Pascal will soon have (unfortunately not in P100 though) is a tantalizing format to target. And also obviously the native fp16 support.

Another area is better efficiency at smaller batch sizes. I have some brand new work there that I'd like to show off. This is important for both inference and scaling to many nodes.

Power comparisons are useful but only when looking at implementations that have the same computational throughput. Or just use some kind of flops/watt metric. With my newer kernels I'm getting very good at squeezing the most out of cache utilization and hence I'm hitting and maintaining higher boost clocks (while using smaller and more versatile tiles).

As for the frameworks, the big area to focus on is graph based optimizations. Maximizing data locality (compounding), memory allocation vs compute trade-offs, auto-parallelizing independent work across streams and gpus, and lots of other creative things computational graphs greatly simplify.

As for synthetic vs real data and parameters.. In fp32 I think only the distribution matters for performance comparisons. But in lower precision like fp16 it's very easy to saturate or underflow with synthetic data which leads to far higher performance than is warranted. At the very least you want to account for the fan-in when setting weight magnitudes (Kaiming, Xavier, etc). Batch norm helps a lot here too. Basically you should be able to prove that you can train with the params you benchmark with.

At the end of the day we care about speed and usability. I think these benchmarks should make both pretty clear. For usability you'll be able to inspect the script to see who has the cleanest syntax and solves the most problems for you without extra steps.

scott-gray avatar Apr 16 '16 01:04 scott-gray

@scott-gray That all sounds great. :)

forresti avatar Apr 16 '16 03:04 forresti

just use some kind of flops/watt metric

Well, the ideal would be joules per batch. But I think this will be tricky to measure. Might need some specialized hardware device, that sits on the power bus?

hughperkins avatar Apr 16 '16 09:04 hughperkins

Maybe it wouldn't be quite so tricky. You'd just need to collect some running average of the on chip power stats during the execution of the epoch. Something like this would give you realtime stats:

nvidia-smi -i 1 --loop-ms=333 --format=csv,noheader --query-gpu=power.draw,clocks.gr,temperature.gpu,fan.speed,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown

Or even better tie your benchmark script directly into NVML queries: https://developer.nvidia.com/nvidia-management-library-nvml

But I guess you'd want to be running these queries continuously so maybe as a separate process would be better. You'd just need to synchronize the collection with the execution of the network. Just a small bit of shell scripting should achieve this.

scott-gray avatar Apr 16 '16 09:04 scott-gray

Or even better tie your benchmark script directly into NVML queries: https://developer.nvidia.com/nvidia-management-library-nvml

Interesting. Seems it's just a c-interface, so accessible using ffi etc.

nvmlDeviceGetPowerUsage(nvmlDevice_t device, unsigned int *power);

hughperkins avatar Apr 16 '16 09:04 hughperkins

And python bindings can be found here: https://pypi.python.org/pypi/nvidia-ml-py

scott-gray avatar Apr 16 '16 09:04 scott-gray

But, it's worth pointing out that the boost clock is already tightly coupled with these real-time power and temperature measurements so the overall timings should be reflective of this. So perhaps it's not worth the effort.

scott-gray avatar Apr 16 '16 10:04 scott-gray

/cc @naibaf7

bhack avatar Apr 17 '16 17:04 bhack

@soumith Great! I'll make sure to finish my OpenCL cuDNN replacement before initial benchmarking. Can I ask to include also AMD and Intel GPU (+CPU) benchmarks (where applicable and possible)?

It is also important to me to benchmark not only networks that saturate the GPU load with a big minibatch size but also those that manage to fill GPU memory and exhaust them with only a batch size of N=1. A good example to use here is the U-Net.

Here a GFLOPs profile of a typical U-Net implementation: layerefficiency_unet-1

Implementation here is Caffe, im2col (no cuDNN), CUDA card: GTX 980, similar to Titan X AMD card: W9100 Intel CPU: i7 4790K

It currently only works on Caffe, but TensorFlow will also support it once Olaf Ronneberger makes it available.

naibaf7 avatar Apr 17 '16 17:04 naibaf7

@naibaf7 some of the benchmarks wont be able to run on the AMD cards which right now seem to be limited to 4GB variants. I have a Fury Nano that I've started playing with. Are Intel GPU benchmarks even relevant? I can run Intel GPU benchmarks on like a macbook pro or something. Intel CPU, def can run.

soumith avatar Apr 17 '16 17:04 soumith

/cc @michaellarabel

bhack avatar Apr 17 '16 17:04 bhack

@soumith A single buffer can't be bigger than 1/4th of the GPUs memory on current AMD GPUs. I think the Fury Nano is not the right candidate to benchmark with.

I have a set of W9100 cards with 16 GB memory that would be more suitable to benchmark AMD.

@soumith please see my updated post above for additional information :)

naibaf7 avatar Apr 17 '16 17:04 naibaf7

My idea to include a U-Net like architecture also ties in with @moloned 's suggestion.

naibaf7 avatar Apr 17 '16 17:04 naibaf7

@naibaf7

W9100s are 2500usd each. You'd need performance to be ~2.5 times faster than Titan X for the price/performance ratio to be attractive. R9 Nanos are 500usd a pop, so you could get two for the price of one Titan X, giving you 8GB total memory, and then use alexnet-style parallelism over these?

hughperkins avatar Apr 17 '16 17:04 hughperkins

@hughperkins If we limit the memory usage to 8 GB the W9100 should perform similarly to R9 390X, which have even higher clock speeds than a W9100 and are clocking in at only 450usd each, roughly 5x less than a W9100. That would be fine at even 2-3x slower than a Titan X (not regarding power consumption).

naibaf7 avatar Apr 17 '16 17:04 naibaf7

Yes. r9-390x are nice :-)

(Edit: it would be really nice if someone could make cloud-available r9-390x's. Since every other cloud offering right now is CUDA, they would at least be unique.)

Edit2: I think it'd be totally fair to use two r9-390x gpus, in a comparison against one Titan X. By the way, Amazon shows r9-390x at 450usd, not 250usd http://www.amazon.com/s?ie=UTF8&page=1&rh=i%3Aaps%2Ck%3Ar9%20390x , but that still means two of them are 100usd cheaper than one Titan X. (or even 3 r9-390x for that matter, Amazon is showing Titan X for around ~1500usd http://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=titan+X&rh=i%3Aaps%2Ck%3Atitan+X ) (edited edit2 a bunch for typos)

hughperkins avatar Apr 17 '16 17:04 hughperkins

@soumith Oh right. My fault on the price conversion. Corrected it above.

naibaf7 avatar Apr 17 '16 18:04 naibaf7

re: power consumption

Ah, hmmm, thats a good point. looks like one r9-390x uses about the same power as a titan x http://www.guru3d.com/articles-pages/msi-radeon-r9-390x-gaming-8g-oc-review,8.html Anyway, seems like a nice card to benchmark on, as long as the benchmarks are tweaked to fit in memory (most of them will run in 8GB I think, with slightly smaller batch sizes?) (edit: corrected link addrss)

hughperkins avatar Apr 17 '16 18:04 hughperkins

Do you have a reference to the paper for MSR's 5 layer FC net?

nervanasys avatar Apr 18 '16 22:04 nervanasys

AFAIK it's just a model for benchmarking, probably derived from some of their fully-connected speech recognition nets. It's similar to the net MSR used in the 1-bit SGD paper for speech recognition (http://research.microsoft.com/pubs/230137/IS140694.PDF), but not identical.

ajtulloch avatar Apr 19 '16 02:04 ajtulloch

Observations:

  • I reckon it'd be cool if the benchmarks are automated, so we can submit a github repo, and a branch name, and push changes to that, and the results will either be generated automatically, or eg once a week.
  • Personally, I doubt I'm going to be able to support so many benchmarks (at least initially), so I think it'd be nice if eg a framework could submit initially only for eg image recognition benchmarks, and then could elect at some later date to submit to one or more additional domains.

hughperkins avatar Apr 26 '16 14:04 hughperkins

but from my own experience with the LSTM benchmark it can be very difficult - you have to make sure literally every hyperparameter is identical, and you effectively can't use any RNGs.

@craffel, couldn't get what do you meant by this. If we are not worried about convergence and look at only the speed, I don't see a real issue with hyperparameters as long as all nets are implemented based on the same pseudocode. Also, I think over time many frameworks will shift to using cuDNN RNN primitives.

I hope there can be some way of comparing the extent of support for recurrence offered by the framework - like presence of scan(...), handling variable length sequences etc

About comparing usability: I think its better not to compare frameworks wrt usability as there is no simple way to quantify the variations among frameworks. Syntax could look clean because of a lot of in-built APIs for standard things now but for "research", these may not help much. Mental overhead while writing these scripts also may have to be considered. So, it is not easy to capture and convey all of these aspects in a meaningful and an unbiased way. In the list of frameworks above, all but Caffe have a numpy-like API for manipulating Tensors in the most common ways and a mention of it somewhere should suffice.

pranv avatar Apr 26 '16 14:04 pranv

@craffel, couldn't get what do you meant by this

This was in response to

One thing I've also been thinking about like @daviddao is how to validate that the models are actually computing the same thing -- I've seen some benchmarks elsewhere that have raised personal doubts that the frameworks are computing the same function.

I was saying that in order to determine that the networks are computing exactly the same thing, the initialization must be exactly the same, etc., which was taking this point to the extreme.

craffel avatar Apr 26 '16 15:04 craffel

@craffel, I thought you were saying something specific about LSTM, sorry, my bad.

pranv avatar Apr 26 '16 15:04 pranv

@soumith thank you for continuing to lead this valuable benchmark project, which benefits the entire machine learning community.

+1 for Intel GPUs. They are potentially very relevant for inference. In a few weeks you will be able to buy an Intel Skull Canyon NUC system for $999 that has 2.3 TFLOPS fp16 on the GPU and integration with a 4-core CPU, at 45 Watts, with 6MB on-chip coherent cache, 128MB on-package last level cache, 16 GB DDR4, and 256GB SSD [1][2][3]. I would guess the total system power is under 150 Watts, as it is intended to compete with game consoles. This platform could in theory compare favorably with a system equipped with a Tesla M4 GPU [4], both in terms of inference throughput and total system power. Of course Skull Canyon was not designed for the data center, but neither was the GTX 980. We can look past this to the next generation of AMD Zen APUs which according to PS4K and Xbox rumors should raise the performance bar quite a bit higher.

[1] http://www.anandtech.com/show/10152/intels-skull-canyon-nuc-is-officia [2] http://ark.intel.com/products/93341/Intel-Core-i7-6770HQ-Processor-6M-Cache-up-to-3_50-GHz [3] http://wccftech.com/intel-iris-pro-graphics-gamers/ [4] http://images.nvidia.com/content/tesla/pdf/nvidia-tesla-m4-datasheet.pdf

andravin avatar Apr 27 '16 23:04 andravin

After talking to Rajat from the TensorFlow team, we thought that an automated Jenkins server will go a long way in maintaining these benchmarks, as well as benchmarking on new hardware. Rajat and Martin Wicke are helping set this stuff up here :) After the initial setup gets going on a digits box, we can figure out how to add donated machines with different configurations from the community as slaves to the jenkins server.

soumith avatar Apr 28 '16 05:04 soumith

@hughperkins : @martinwicke is working with @soumith to set something like this up through our Jenkins setup. The idea would be to run these tests automatically every X days or so.

@andravin @naibaf7 I like the idea of running on other relevant hardware e.g. Intel and AMD GPUs that make sense. Not all libraries may support every vendor, but will be useful to have baselines from even a few.

rajatmonga avatar Apr 28 '16 05:04 rajatmonga

@soumith Are accuracy / convergence tests back on the table? Per layer numeric accuracy and iterations to / accuracy of convergence are both informative measures. We do not really understand in detail how the former affects the latter, so both would be interesting.

All floating point calculations are approximations. fp16 quantization is an explicit approximation. Fast algorithms have different accuracy than naive algorithms. Bugs are very unstable approximations. ;-)

I have considered submitting a closed source "fp0" kernel to convnet-benchmarks. Only after results are released will I reveal that it just fills the output buffer with zeros. Then maybe everybody would appreciate the importance of accuracy tests.

andravin avatar May 08 '16 16:05 andravin

@andravin accuracy / convergence tests are off the table atleast for the first deepmark release. There's simply not enough bandwidth / man-power to pull it off.

I have considered submitting a closed source "fp0" kernel to convnet-benchmarks. Only after results are released will I reveal that it just fills the output buffer with zeros. Then maybe everybody would appreciate the importance of accuracy tests.

Convnet-benchmarks and deepmark are done in good faith and really are a tool for the community built by the community. There are several things that can be done to show an adversarial setting, but that would be pointless. Each of these frameworks are also used widely in other settings, making them have an implicit accuracy test.

soumith avatar May 08 '16 16:05 soumith

@andravin

If DeepMark works like convnet-benchmark worked, and I have no reason to think that it wont, then all the scripts are publicly available, in a relatively standard-ish format. There is no reason why you couldnt set up a parallel repository, which runs accuracy tests against each library, and reports the results.

hughperkins avatar May 08 '16 17:05 hughperkins

(actually, in general, in my opinion, it would be better to factorize deepmark into per-domain repos, like images, nlp, video, etc, and get them working one by one. Otherwise it risks becoming a bit of a Doom3 :-D Just my opinion though.... ).

(Edit: do I mean Doom3? Do I mean Duke Nukem Forever? Anyway, either way...)

hughperkins avatar May 08 '16 17:05 hughperkins

@soumith I have been a software engineer for too long to believe that good intentions are enough to ensure accuracy. ;-) I do believe that rewarding frameworks for speed without regard for accuracy creates a moral hazard.

In order to believe that the community is taking care of convergence testing, I would want to see links to some learning curves for recent Imagenet winners (not Alexnet, a recent very deep model from the last year) using fp16, FFT, and Winograd kernels.

andravin avatar May 08 '16 19:05 andravin

I think that also a miracle for a standard definition of a network specification and its weights could really help the whole ecosystem. Tensorflow is in a position that could try to keep the responsibility of the leadership for a standard with versioning but probably everybody want to eat its own food.

bhack avatar May 08 '16 19:05 bhack

I think that also a miracle for a standard definition of a network specification and its weights could really help the whole ecosystem.

Just as for andravin's request, this could also be done by an individual contributor, orthogonally to all other efforts: simply pick one format, or make up a new one (obligatory xckd :-D https://m.xkcd.com/927/ ), and create converters to/from the most important frameworks. But ... isnt caffe zoo kind of already a de facto standard for this?

hughperkins avatar May 08 '16 19:05 hughperkins

So protobuf is the de facto standard? I believe only in something more formal where all major framework are involved by design without to/from converters (or only optional for back compatibility). Cudnn API in some way is the only de facto standard.

bhack avatar May 08 '16 19:05 bhack

For training with existing fp16 kernels you'll likely need a few tricks. To allow weight updates to proceed there needs to be enough overlap in mantissa and the weight for the summation to have any effect. Batch norm helps a lot here. You can also use stochastic rounding, which allows weight updates even when there is no effective overlap. But probably the best way is to leverage the fact that the filter tensors are relatively small and just keep them in fp32. You can quantize just prior to the conv operation. If you're already doing some kind of transform or dimshuffle you can just bake that quantization into that at no cost.

Batch norm also helps a lot with keeping the activations and delta's bounded within the range that fp16 can represent. Weight initialization that factors in the fan-in also helps here. Winograd transforms tend to widen the dynamic range of the activations/deltas so that's something to look out for too.

On upcoming pascal hardware we'll really need to fully explore the limits of fp16 accumulate. I'm guessing some of the faster winograd transforms might break down here. It's a shame that nvidia didn't make that fp16x2 instruction an inner product accumulate like the new int8x4 instruction will be. With that you could multiply two sets of fp16 values and accumulate to a single fp32 register.

Anyway, I think with a bit of care you can generally achieve exactly the same accuracy with fp16 as you can with fp32. I've even seen the reduced precision act as a regularizer to increase accuracy.

scott-gray avatar May 08 '16 19:05 scott-gray

So protobuf is the de facto standard? I believe only in something more formal where all major framework are involved by design without to/from converters

Well, I'm not sure about the 'without to/form converters' part, but torch already supports protobuf. As does caffe. mxnet https://github.com/dmlc/mxnet/blob/master/tools/caffe_converter/README.md tensorflow https://github.com/ethereon/caffe-tensorflow

hughperkins avatar May 08 '16 19:05 hughperkins

Yes but every framework write whatever they want with protobuf. Converters don't avoid framework feature disaligment. Actually in production seems that cudnn versions defines what it is really important to introduce in API. It could be really better that with a common format this could be defined by design by frameworks i.e. like in containers domain

bhack avatar May 08 '16 20:05 bhack

Actually in production seems that cudnn versions defines what it is really important to introduce in API. It could be really better that with a common format this could be defined by design by frameworks i.e. like in containers domain

Without having looked at it (because I dont want to be exposed to the IP, if I can avoid it), I would think that the cudnn interfaces are very good. However, note that to see their api, as far as I know, you have to click through a licensing agreement saying you understand it is their IP and so on, so I'm not sure that one could use the cudnn API as a standard? (unless you could get agreement from NVIDIA of course). By the way, you are hiding behind a pseudonym, I am curious what is your real name :-)

hughperkins avatar May 08 '16 20:05 hughperkins

I will like a common roundtable like for containers format hosted by linux foundation. I'm sure that this format need to be always a little bit behind state of the art but still very helpful in production. Probably @openai could has the "neutral status" to host a multi stakeholders design and maintenance.

P.s. I'm a simple GSoC 2016 mentor for OpenCV foundation :)

bhack avatar May 08 '16 20:05 bhack

Probably @openai could has the "neutral status" to host a multi stakeholders design and maintenance.

Interesting

hughperkins avatar May 08 '16 20:05 hughperkins

I am doubtful that the sourceforge (sic, really?) thing is related to openai (https://openai.com) On Sun, May 8, 2016 at 13:42 Hugh Perkins [email protected] wrote:

Probably @openai https://github.com/openai could has the "neutral status" to host a multi stakeholders design and maintenance.

Interesting

You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/101#issuecomment-217744904

martinwicke avatar May 08 '16 20:05 martinwicke

I am doubtful that the sourceforge (sic, really?) thing is related to

yeah, hence why I deleted my original comment :-D

hughperkins avatar May 08 '16 20:05 hughperkins

Sorry. Working off emails, didn't see. :) On Sun, May 8, 2016 at 13:45 Hugh Perkins [email protected] wrote:

I am doubtful that the sourceforge (sic, really?) thing is related to

yeah, hence why I deleted my original comment :-D

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/101#issuecomment-217745062

martinwicke avatar May 08 '16 20:05 martinwicke

@scott-gray I am sure you know the difference between an idea that seems like it should work and an experimental result that is publicly available.

In order to believe that the community is taking care of convergence testing, I would want to see links to some learning curves for recent Imagenet winners (not Alexnet, a recent very deep model from the last year) using fp16, FFT, and Winograd kernels.

I am a little bit surprised that nobody addressed this point.

andravin avatar May 08 '16 20:05 andravin

@andravin I agree that we need this. It's just hard to make it a priority over other things. But I think with pascal coming out, there will be a real push to make fp16 work since the potential speedups will be so much more dramatic than they are now.

scott-gray avatar May 08 '16 21:05 scott-gray

In terms of a standard format for representing DNN architectures and weights:

There is a tension between research and standardization. As mentioned above, there is some native support and/or conversion scripts for protobuf representations among caffe, torch, tensorflow, and mxnet. But, as research charges ahead, people in each framework/community are trying different types of new layers, solvers, weight initialization schemes, etc. Convolutions and SGD are very popular, but the finer points are still in flux. It may be too early to be worth trying to rigidly standardize all the frameworks.

forresti avatar May 08 '16 21:05 forresti

@forresti Exactly what I have told. Standard need to stay behind state of the art. Is cudnn really cutting edge (excluding the performance point of view)? No, arxiv is cutting edge, cudnn is behind and probably production. See also Openvx at Kronos for CV.

bhack avatar May 08 '16 21:05 bhack

echo @bhack I have some thoughts in turns of standardize the frameworks. It is fundamentally hard for all frameworks to be forced into one.

However, we might have enough idea on what are the common interface in a system that could be shared across frameworks, CuDNN is a perfect example of these. The possible things I can think of includes:

  • Operator interface for registering kernels on different types of devices.
  • Dependency scheduler to schedule things across devices.
  • Data loading module to keep up with pre-processing pipeline.
  • Representation of computation graph.
  • Storage format of tensors.

I can see the finer we can break them, there is a better chance to see chance of sharing of some modules between frameworks, and move the field as a whole. I would love to see such discussion happen, and possibly in a neutral way

tqchen avatar May 08 '16 22:05 tqchen

However, we might have enough idea on what are the common interface in a system that could be shared across frameworks, CuDNN is a perfect example of these.

On a somewhat related note, I was thinking about how we could get portable dnn libraries, across multiple hardware, not just cuda hardware. We already have portable implementations of caffe, torch, maybe some others, by writing the kernels in OpenCL. However, the performance of the convolutions is arguably not as fast as one might achieve by using the hardware-level optimizations in winograd, and perhaps cudnn.

I'm thinking that it might be useful to have an api at exactly the cudnn level, so that libraries will automatically call cudnn on cuda platforms, or eg greentea on amd platforms, or some other convolutional library on eg Intel platforms. I made a pretty slide of this here :-)

openclconv1c

And note that this architecture would say nothing about who writes the hw-level implementation, eg one could imagine that the end-user could install winograd kernels on their machine, and caffe, mxnet, torch etc would automatically make use of them, without the dnn library authors even needing to know about this, or compile-time link with them:

openclconv2

hughperkins avatar May 08 '16 23:05 hughperkins

@hughperkins Thanks for including my efforts :) not quite there yet with performance (writing the autotuning code at this instant).

I think this is a very nice idea. At least the convolutions should be somewhat standardized as they are the most performance critical part and about the same in every framework.

naibaf7 avatar May 08 '16 23:05 naibaf7

Well, I'm not sure about the 'without to/form converters' part, but torch already supports protobuf. As does caffe. mxnet https://github.com/dmlc/mxnet/blob/master/tools/caffe_converter/README.md tensorflow https://github.com/ethereon/caffe-tensorflow

sklearn-theano also has a converter.

Pascal

lamblin avatar May 08 '16 23:05 lamblin

While I find the discussion of standardization interesting and important, I really do not need any of that to do basic convergence testing.

For example, if I had Neon configuration files for Resnet or Inception that were known to give state of the art results when trained with fp32 direct convolution kernels, and access to a multi-GPU machine, then I could easily substitute the Winograd fp32 F(2x2,3x3) kernels for all the convolutions and compare the learning curve to the original. As simple as this experiment is, nobody has ever reported these results. Then you run the experiment again with F(4x4,3x3) kernels.

As @scott-gray mentioned, switching to fp16 weights might need some extra tricks. But all of this could be done within Neon.

Likewise the comparison of cuDNN direct convolution fp32, direct convolution fp16, and FFT should be possible entirely within TensorFlow.

Comparison of results between different frameworks might be a bit of an apples to oranges comparison, but within-framework comparison of different algorithms gives us a measure of the usefulness of state of the art kernels. Also it would force us to nail down the tricks that make them perform best.

I would think it is in the best interest of framework maintainers to facilitate these experiments.

andravin avatar May 09 '16 18:05 andravin

This @vrv comment it is quite auto explicative of some of the standardization posts in this thread

bhack avatar May 12 '16 15:05 bhack

  1. eh, software is tricky, nothing to do with standardization
  2. this conversation has ratholed from its original intent to discuss benchmarks, though the discussion is interesting.

vrv avatar May 12 '16 15:05 vrv

  1. Yes but why all the frameworks need to do it on cudnn? :)
  2. Sure but there is no space for this cause frameworks still don't has started a discussion on this topic.
  3. One hardware vendor it is doing a standard de facto at some level.

bhack avatar May 12 '16 17:05 bhack

@vrv Convergence speed vs. accuracy tests are benchmarks. But the idea has no traction here so I will do it somewhere else.

andravin avatar May 13 '16 02:05 andravin

@andravin they ARE benchmarks, and actually my early versions of deepmark iterated on convergence speed. But this requires way more effort than is available, so i converged onto something a bit cheaper overall...

soumith avatar May 13 '16 02:05 soumith

My comment on ratholing was about standardization of APIs and frameworks -- I do think convergence speed and accuracy tests are important and we should eventually have them.

vrv avatar May 13 '16 02:05 vrv

Yes we have diverted the discussion at some point. I hope that the framework with more starts on github on day could restart that topic in a more appropriate context than this one.

bhack avatar May 13 '16 08:05 bhack

@soumith yes convergence tests of course require a lot of compute time. can we at least incorporate numeric accuracy tests with synthetic data, as I think @vrv mentioned earlier?

andravin avatar May 13 '16 20:05 andravin

@andravin no. not in V1.

soumith avatar May 13 '16 20:05 soumith

One point about synthetic accuracy tests is that it doesn't necessarily correspond to final test accuracy. As I was saying earlier, low precision can sometimes produce better results. I'll quote the conclusion of this paper here:

Low precision prevents the optimizer from finding solutions that require a lot of precision, which correspond to very thin (high curvature) critical points, and these minima are more likely to correspond to overfitted solutions then broad minima (there are more functions that are compatible with such solutions, corresponding to a smaller description length and thus better generalization).

scott-gray avatar May 13 '16 20:05 scott-gray

Step 1: define important models / benchmarks Step 2: make sure people's definitions of models match so the numbers are actually comparable Step 3: care and compare multi-step convergence

I don't think you can have 2 without 1, and 3 without 2. So the order @soumith is pursuing this seems fine.

vrv avatar May 13 '16 20:05 vrv

The approach I would take is:

A) kernel efficiency versus numeric accuracy testing B) whole network convergence A/B testing (vary one of 1. kernel, 2. precision, 3. batch size, etc...). C) framework profiling (measure time spent in low level kernels and compare with total run time).

I think that will give us a more detailed picture of the factors that contribute to neural network performance.

andravin avatar May 13 '16 22:05 andravin

So, getting back to the discussion. I have volunteers from a few frameworks, looking for some.

Volunteers available

Torch: Soumith, @SeanNaren , Shubo Sengupta (Baidu) TensorFlow: @vrv and @martinwicke Neon: Evren Tumer @nervetumer Chainer: @delta2323 and team MXNet: Edit: @antinucleon will do it Theano / Lasagne / Keras: @f0k and @pranv (and maybe @craffel ) Caffe / NVIDIA-Caffe: [Need volunteers] @thatguymike and team at NVIDIA

Volunteers needed

N/A

If I dont find volunteers for some frameworks, I'll (without guarantee) try to do them myself.

soumith avatar May 13 '16 22:05 soumith

@soumith OK. I will do it

antinucleon avatar May 13 '16 22:05 antinucleon

The 3 different benchmarks I outlined A) kernel speed/accuracy, B) network convergence, C) framework profiling are independent of each other, can be done in any order, and can even be separate projects.

convnet-benchmarks and the proposed deepmark are a very coarse grained version of C) framework profiling (they just measure the whole network forward/backward iteration time). You could get more information out of the benchmarks if you drilled down to per layer timings, and even better, drill down to kernel timings. Then we would know why one framework is faster than another, which the current tests do not tell us.

A good graph computation framework would profile itself continuously under normal operation. I do not know the extent to which TensorFlow and Neon do this already, but it is not hard to instrument code so that all layers and kernel calls are timed and then generate summary statistics. Of course memory use could also be measured.

So I see a good framework profiling benchmark:

  1. assisting framework maintainers to instrument their code for auto-benchmarking
  2. providing standard synthetic data generators for each framework to wrap as an input layer
  3. creating per-framework configuration files for each test
  4. running tests and collecting the benchmark data logged by the framework
  5. generating reports from benchmark data

Instrumenting frameworks for auto-benchmarking would also help users tune networks for fast operation, and help framework maintainers identify what to optimize. As such the deepmark profiling report would be a tool used not just for benchmark contests, but part of the everyday machine learning workflow.

andravin avatar May 14 '16 00:05 andravin

Andrew,

No-one here is seeing that we dont want correctness. I think it's fair to say we would all be happy to see tests for correctness. I think it's also fair to say that most frameworks already have unit tests and other tests for correctness. I think it's fair to say that anything that hasnt been tested generally doesnt work, so I would be very surprised to see a framework without any kind of testing :-)

I think the only question is: to add correctness tests alongside the exact models in this benchmark, and to do it in a way so you can show numerical equivalence, down to say 4 significant figures, across all weights and activations, is a phenomenal amount of work, that goes far beyond what is already in place for our correctness checking. If you are saying that you are going to handle this effort, and all we have to do is sit back and look at the results, then I dont think there is any discussion required :-)

hughperkins avatar May 14 '16 09:05 hughperkins

You can count me in again for Theano (either vanilla or using Lasagne), but I'd prefer not to be the only one.

f0k avatar May 14 '16 12:05 f0k

I can help @f0k with Theano, but only after 23rd.

pranv avatar May 14 '16 15:05 pranv

@hughperkins please read my posts more carefully. I wrote that deepmark could be a framework profiling tool rather than just a forward/backward iteration timing tool.

TensorFlow is now instrumented for profiling: https://github.com/tensorflow/tensorflow/issues/1824

Numeric accuracy tests, on the other hand, can be a separate project and should be at the level of individual kernels.

andravin avatar May 14 '16 15:05 andravin

@hughperkins please read my posts more carefully. I wrote that deepmark could be a framework profiling tool rather than just a forward/backward iteration timing tool.

Hi Andrew, fair point :-D

hughperkins avatar May 14 '16 16:05 hughperkins

/cc @nyanp if he is interested for tinycnn. But I don't know if could be ready for this with full models coverage and GPU backend support.

bhack avatar May 14 '16 20:05 bhack

I apologize if my comment further derails this thread but this seems like an appropriate time to bring it up. I wonder if anyone is interested in strong scaling the time to convergence mentioned earlier? This would require what andravin calls tasks A and B above. It will soon be required to demonstrate convergence and the resulting accuracy at various network/dataset scales.

I'm less worried about functional profiling since external tools exist to do this fairly well and the algorithmic bottlenecks in sgd are a good first order place to optimize. It's also good to see mentioned the effectiveness of quantization (1bit and 8bit sgd) as I suspect these will become important features in frameworks to come especially with regard to scaling.

I've been working with CNTK on many node systems recently and have become interested in how other platform's model parallelism scales in terms of time to solution (which is a tricky thing the community needs to define sooner or later). I really like the idea of having a common number to compare across frameworks (samples per second / epoch work well but don't capture wall time ) , which has inspired needed attention by developers on performance at the particular scale which is most used today. The main issue becomes larger and more complex models need to be used to demonstrate strong scaling as system size increases beyond a few gpus. I've run into this problem on large systems because there just isn't enough work to distribute using current problem sizes.

That said, it would be incredibly useful to have a set of even two or three (small medium large) benchmarks (convolution + fully connected) which use the same training data but at distinct scales in terms of number of features, input dimension and hiddens such that we could measure and estimate how well the model/data parallelism preforms across MANY cpus and gpus and it would help us pin down performance bottlenecks as a side effect. The convergence and accuracy criteria then become important though, as well as going beyond the current memory requirements.

This is a dream list of criteria to meet and no small task but I wonder if anyone else would be interested in such an effort or perhaps it's already going on somewhere?

I think the profiling and benchmark efforts here are great and incredibly beneficial in creating standards / best practices across frameworks going forward. Thanks for all the work and for a great thread to read at the least!

jbalma avatar May 15 '16 05:05 jbalma

Excellent @pranv and @f0k . @craffel was interested in partly doing Theano benchmarking as well.

soumith avatar May 16 '16 19:05 soumith

I'd hope this also supports CPU training/testing. My mojo-cnn project seems to be relatively fast on CPU for smaller problems and i know intel is still trying to push CPU use with their DNN wrappers for MKL to better compete with GPUs.

gnawice avatar May 17 '16 17:05 gnawice

@jbalma yes I find strong scaling the training problem to be very interesting, and of course that is an active area of research. I hope you find a good forum for pursuing it. Let me know if I can help.

I would also point out with regards to profiling, an external profiling tool will of course not be able to correlate timings with the graph structure of the network, so it cannot group statistics by layer, which is essential for understanding the performance of the network. I think all successful machine learning frameworks will be instrumented for profiling eventually, and tensorflow appears to be taking the lead there. Now imagine the power of being able to compare network performance graphs across different frameworks, because they all were designed to output profiler stats in the same format.

andravin avatar May 19 '16 06:05 andravin