Intel released a small blog-post recently covering that they have crazy-talk speeds for ConvNets on their Haswell CPU line. I took their Caffe implementation, painfully installed the dependencies, and the numbers look almost too good to be true. Either someone refutes me, or these are very cool numbers.

Link to blog-post: https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors

A full [forward + backward] on AlexNet on a Desktop 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz takes an average of 164ms EDIT: 268 ms.

Just for comparison, the latest and greatest NVIDIA Titan-X does the same round-trip in 96 ms. An older generation GPU like Tesla K40 is slower, pegging at around 200+ ms.

I tried to get VGG working, but ran into assertions about unimplemented code pathways, but regardless, if AlexNet seems to be this fast, the others will probably in the ballpark.

Can someone else try the Intel stuff? I need a couple more sanity checks before I can believe this result. Look at how little time they are spending in the convolution layers, even the biggest ones: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365

Oct 15 '15 00:10 soumith

For comparison, here's the log of Caffe + OpenBLAS numbers on the same machine (It's the Digits box ;-) ) https://github.com/soumith/convnet-benchmarks/blob/cpu/caffe/output_alexnet.log

Oct 15 '15 00:10 soumith

More info is in the CPU branch: https://github.com/soumith/convnet-benchmarks/tree/cpu

The alexnet-owt protobuf, with the same architecture I use for the GPU versions is here: https://github.com/soumith/convnet-benchmarks/blob/cpu/caffe/imagenet_winners/alexnet.prototxt

The intel-adapted version is here: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/models/intel_alexnet/alexnet.prototxt

Oct 15 '15 01:10 soumith

well, assuming i didn't mess up the analysis, and used the right inputs/etc, a runtime of 0.146s on the (non-intel) alexnet-owl prototxt you linked above, for a batch of 128 forward and backward, implies 3.77TF/s.

AFAIK, haswell can do at most 32FLOPs/cycle/core. for your 6-core cpu @ 3.5 GHZ, that would be 672GF/s peak.

so, i guess that seems pretty fishy overall (i.e. perf ~6X peak). i might suspect benchmarking error, such as accidentally running in GPU mode with who-knows-what backend (i.e BLAS, cudnn v?, i dunno). it's not clear that intel themselves was claiming perf anything like that in thier blog post, but i didn't try to runs the #s on their post.

then again, i have no idea what the intel code might be doing (got scared off by the license, so didn't dig into it), but if there are some algorithmic changes and/or anything that means they're not doing the same set of FLOPS, then all bets are off. but of course such improvement might port to GPUs as well. or not; i'd believe there are algorithms that are more suited to CPUs that trade uniformity/complexity for doing less raw FLOPS.

for ref, here's the #s i'm working from:

moskewcz@maaya:~/git_work/boda/run/tr1$ boda cnet_ana  --in-model=alexnet_owl --print-ops=1 --in-sz=227 && python ../../pysrc/flops.py  --per-layer=1 --backward 1 --num-imgs=128 --runtime=.164

conv1 FWD 18.7GF 182MB  --- BACK_GRAD 18.7GF  --- BACK_DIFF 18.7GF  BACKWARD_BYTES 261MB 
conv2/5x5_s1 FWD 61.7GF 104MB  --- BACK_GRAD 61.7GF  --- BACK_DIFF 61.7GF  BACKWARD_BYTES 131MB 
conv3/3x3_s1 FWD 33.3GF 60.5MB  --- BACK_GRAD 33.3GF  --- BACK_DIFF 33.3GF  BACKWARD_BYTES 82.4MB 
conv4/3x3_s1 FWD 44.4GF 67.8MB  --- BACK_GRAD 44.4GF  --- BACK_DIFF 44.4GF  BACKWARD_BYTES 110MB 
conv5/3x3_s1 FWD 29.6GF 53.7MB  --- BACK_GRAD 29.6GF  --- BACK_DIFF 29.6GF  BACKWARD_BYTES 81.8MB 
fc6 FWD 13.2GF 214MB  --- BACK_GRAD 13.2GF  --- BACK_DIFF 13.2GF  BACKWARD_BYTES 426MB 
fc7 FWD 4.29GF 71.3MB  --- BACK_GRAD 4.29GF  --- BACK_DIFF 4.29GF  BACKWARD_BYTES 141MB 
fc8 FWD 1.05GF 19.0MB  --- BACK_GRAD 1.05GF  --- BACK_DIFF 1.05GF  BACKWARD_BYTES 37.5MB 
total _inxp time:  0s
-- INPUT: NUM_IMGS=128 --
-- INPUT: RUNTIME=0.164s --
-- INPUT: POWER=200W --
--- FWD TOTALS ---
618GF 3.77TF/s
2.04GB 12.5GB/s AI=303F/B
32.8J 18.8GF/s/W
moskewcz@maaya:~/git_work/boda/run/tr1$

Oct 15 '15 02:10 moskewcz

@moskewcz 3.77TF/s doesn't hold true if you switch to FFT or Winograd based convolutions.

References: https://en.wikipedia.org/wiki/Convolution_theorem http://arxiv.org/abs/1509.09308

Oct 15 '15 02:10 soumith

"With these optimizations time to train AlexNet* network on full ILSVRC-2012 dataset to 80% top5 accuracy reduces from 58 days to about 5 days."

The benchmark used dual E5-2699-v3 CPUs, which have 18 cores at 2.3 GHz => 2x18x32FLOPs/cyclex2.3Ghz=2.65TFLOPs

Sounds about right.

TitanX running Nervanagpu probably about 1 day?

I would guess Intel just implemented a more efficient direct convolution for many-core Intel CPUs. I do not see any indication they are using fast algorithns.

Oct 15 '15 03:10 andravin

So anyway the numbers Intel reported sound plausible, but your numbers don't. :-)

Oct 15 '15 03:10 andravin

again, if i got my #s right, if we assume 70M images (~65 epochs * 1.1M images/epoch, not sure if that's a good value or not) in 5 days to train alexnet_owl as per the blog post, that implies 783GF/s -- given the peak #s that andravin gave above, that would be ~35% efficiency, which is perhaps pretty impressive but believable. but it'd be good to know the actual # of epochs/images/etc to get a real value, i could easily be off by quite a bit on those guesses. corrections welcome.

mwm

moskewcz@maaya:~/git_work/boda/run/tr1$ boda cnet_ana  --in-model=alexnet_owl --print-ops=1 --in-sz=227 && python ../../pysrc/flops.py  --per-layer=1 --backward 1 --num-imgs=70000000 --runtime=432000
conv1 FWD 10.2PF 99.5TB  --- BACK_GRAD 10.2PF  --- BACK_DIFF 10.2PF  BACKWARD_BYTES 143TB 
conv2/5x5_s1 FWD 33.7PF 56.2TB  --- BACK_GRAD 33.7PF  --- BACK_DIFF 33.7PF  BACKWARD_BYTES 70.2TB 
conv3/3x3_s1 FWD 18.2PF 31.6TB  --- BACK_GRAD 18.2PF  --- BACK_DIFF 18.2PF  BACKWARD_BYTES 42.1TB 
conv4/3x3_s1 FWD 24.3PF 35.1TB  --- BACK_GRAD 24.3PF  --- BACK_DIFF 24.3PF  BACKWARD_BYTES 56.2TB 
conv5/3x3_s1 FWD 16.2PF 28.1TB  --- BACK_GRAD 16.2PF  --- BACK_DIFF 16.2PF  BACKWARD_BYTES 42.1TB 
fc6 FWD 7.19PF 4.66TB  --- BACK_GRAD 7.19PF  --- BACK_DIFF 7.19PF  BACKWARD_BYTES 8.17TB 
fc7 FWD 2.35PF 2.29TB  --- BACK_GRAD 2.35PF  --- BACK_DIFF 2.35PF  BACKWARD_BYTES 3.44TB 
fc8 FWD 573TF 1.43TB  --- BACK_GRAD 573TF  --- BACK_DIFF 573TF  BACKWARD_BYTES 2.57TB 
total _inxp time:  0s
-- INPUT: NUM_IMGS=70000000 --
-- INPUT: RUNTIME=432000.0s --
-- INPUT: POWER=200W --
--- FWD TOTALS ---
338PF 783GF/s
627TB 1.45GB/s AI=540F/B
86.4MJ 3.91GF/s/W
moskewcz@maaya:~/git_work/boda/run/tr1$

Oct 15 '15 03:10 moskewcz

.. and having looked a bit at Caffe's CPU implementation, im2col is single-threaded, and will be a pretty nasty bottleneck in a 36-core system.

Oct 15 '15 04:10 andravin

@moskewcz your numbers sound plausible to me.. and so Intel's post really points to what a disaster out of the box Caffe performance must be on many-core CPUs.

Oct 15 '15 04:10 andravin

@andravin @moskewcz thanks. I'm going to investigate a bit on why the numbers are much more fluffier on my machine. For a start, I'll probably start an end-to-end training and see what happens....

Oct 15 '15 04:10 soumith

sounds like a plan. make sure you fire up nvidia-smi while you're running it ... ;)

Oct 15 '15 05:10 moskewcz

@moskewcz I've already verified that it's running on CPU and using intel code-paths, simply by collecting samples from the stack and looking at hotspots.

Oct 15 '15 14:10 soumith

hmm, well, i was mostly joking and i mostly believe you. however, i'm not sure that what you say precludes the GPU being active. in fact, if, say, the new intel layers were running on the CPU, but all/some conv layers were on the GPU, you'd probably see perf similar to what you reported. and if you look at the CPU usage/stack, it'll be pegged at 100%, and it'll always be inside the intel code if you stop it ...

i'm really just suggesting that, given the fishiness of the #s, some form(s) of sanity checking are in order. in particular, for example, did you compile in CPU only mode? again, i don't really think that's the issue, but if (for example) intel ran/compiled on boxes without GPUs, then maybe something unexpected happens with their code/build on a box that has GPUs.

but i'm not really fixated on the maybe-running-on-GPU idea, there are plenty of other places for errors. batch size issues, shared library wackiness, straight-up user error, etc ...

on a side note, thanks for all your hard work running these benchmarks!

mwm

Oct 15 '15 14:10 moskewcz

caffe is getting no access to the GPUs, I disabled it at the driver level. I just fixed the protobuf to force itself to do the backward phase (it was conveniently deciding that it doesn't need to do the backward). That brought the backward times up, and overall it stands at 268ms / mini-batch now. I'm working on training it fully with the imagenet lmdb. Let's see. https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365

Oct 15 '15 18:10 soumith

2-columnn AlexNet Intel is benchmarking at the announcement (different from 1-col AlexNet "One weird trick" from Soumith's benchmark) has 1449 MFLOPs per image in the forward pass and 2x that in the backward pass, ignoring biases, LRN, activations, pooling, and loss. Taking numbers from Intel's announcement we have:

Forward pass: 1449 MFLOP * 731images/1sec=1.059 TFLOP/s Forw+Backw pass: 3 * 1449 MFLOP * 271 images/1sec=1.187 TFLOP/s

which is easily believable (exact max FLOPs on those Intel CPUs to be posted later).

Oct 16 '15 17:10 ozabluda

@soumith>A full [forward + backward] on AlexNet on a Desktop 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz takes an average of 164ms EDIT: 268 ms. [...] I need a couple more sanity checks before I can believe this result. Look at how little time they are spending in the convolution layers, even the biggest ones: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365

i7-5930K AVX2 clock is smaller than 3.50 GHz base clock. I don't recall exact value, but it seems to be ~3.2 GHz. It can issue 2 AVX256 (8 operand) SP MAD (=2FLOP) per clock, for the total of 2 * 8 * 2=32 FLOP/clock.

32 FLOP/clock * 3.0GHz * 6core=576 GFLOP/s.

Your numbers at the url above (output from Intel's Caffe) seem to be per image for conv and per minibatch for fc) and are comfortably below that (except for fc6 backward, which must be an artifact of Caffe timing), so they are totally believable. In fact, there is a lot of room for improvement. In fact, they are not that much better than your numbers for OpenBLAS (except for conv1)

layer	MFLOP/image	ms/image	Intel GFLOP/s	OpenBLAS GFLOP/s
conv1 forward:	141	0.726	194	59
conv1 backward	282	0.672	420	67
conv2 forward:	448	3.722	120	159
conv2 backward:	896	6.22	144	167
conv3 forward:	224	2.323	96	112
conv3 backward:	448	3.604	124	148
conv4 forward:	299	3.851	78	90
conv4 backward:	598	6.344	94	116
conv5 forward:	199	2.621	76	90
conv5 backward:	398	4.375	91	119
fc6 forward:	75	38.597/mb	250	232
fc6 backward:	151	32.152/mb	601	243
fc7 forward:	34	18.549/mb	232	231
fc7 backward:	67	15.504/mb	554	293
fc8 forward:	8	4.967/mb	211	249
fc8 backward:	16	3.932/mb	533	278
Forward:	1428	90.621	104	94
Backward:	2856	72.961	132	123
Forward-Backward:	4231	1684	121	112

Oct 16 '15 18:10 ozabluda

@ozabluda i think your analysis of the intel #s looks good and is believable. as per an above comment, we're guessing ~2.65TFLOPs peak for the dual-socket 36-core machine intel used for the announcement. so again it comes out to ~35% or so efficiency.

but, i think there are some issues with your per-layer analysis in your second comment. firstly, i don't think we can trust the per-layer #s from the caffe log too much; for example the pack+relu1 times are >> the conv1 time, so i'd assume there's some timing wonkiness there -- time and/or work being shifted among layers for example.

but, perhaps more importantly (and confusingly):

the 1684 ms is for 10 iterations/batches. this is the value that got corrected to ~2680ms, with a corresponding 268ms forward+backward per batch. confusingly, the other two #s for forward and backward (the ~73ms back / ~91ms fwd) are per single iteration/batch. the idea is that they are the 'min' batch times across interactions, and thus in theory more indicative of the steady-state per-batch performance (which does seem to be the case). so for your forward-backward line you probably want to add the times of the forward and backwards lines and ignore the overall combined time. alternately you could divide it by the iteration count which will yield a similar value.
the 268ms is for a 128 image batch, not a single image. i believe your flop #s are for a single image (i have ~6.1GF for the no-groups regular alexnet per image, so i'd guess that your 4.2GF / image is right for the 'original' 2-groups version), so you're off by a factor of 128 in flops.

PS: using 268ms / batch, and 4.2GF / image, that yields a still-implausible ~2TF/s for the 6-core digits box, and again it seems to disagree with the more-reasonable intel announced #s, so i'm still assuming benchmarking error.

Oct 16 '15 19:10 moskewcz

There is no such thing as an AVX2 clock.

Oct 16 '15 20:10 andravin

@moskewcz I also noticed that Intel's Caffe seems to report timings for conv layers per image and for fc per minibatch. I corrected the table above (I also realized Soumith's numbers are for 1-col AlexNet, while Intel's are for 2-col AlexNet). Please check if it makes sense to you now.

AVX2 (32 SP ops/clock) can't run at the base clock frequency, so it throttles down to a lower "AVX clock". Although, maybe it is only true for AVX-512, which none of the CPUs in question have.

Oct 16 '15 20:10 ozabluda

@ozabluda hmm, i'm not sure what you changed, but i guess it looks more/differently wrong to me now, still as per my (1) and (2). AFAIK all the caffe timings are supposedly per batch/iteration, not per image (as per my comment section (2)). and in this case, they look like garbage, as per my comment section (1). FWIW it's been a while since i dug into the caffe timing code and it has changed over time but on the whole i've always found it hard to work with / understand; i'm mostly just looking at things here from the top level and using my own calculations, so i'm not the best one to comment on the details of the caffe reported #s.

Oct 16 '15 20:10 moskewcz

@moskewcz Stock Caffe timings sure are per minibatch (like Soumith's OpenBLAS timings). Intel's port timings do look like garbage (say 0.726ms for conv1), unless they are per image (except for fc), in which case they totally make sense (and approximately equal to stock Caffe/OpenBLAS). See my table above.

Oct 16 '15 20:10 ozabluda

@andravin> The benchmark used dual E5-2699-v3 CPUs, which have 18 cores at 2.3 GHz => 2x18x32FLOPs/cyclex2.3Ghz=2.65TFLOPs

Actual AVX base clock is 1.9 Ghz (see quote below).

2 CPU * 18 cores * 32FLOPs/cycle * 1.9Ghz =2.189 TFLOP/s

I am almost willing to bet that the scaling to the second CPU is extremely poor in this Intel's iteration. i.e. 2 CPUs are not that much faster than 1 CPU.

To cope with the huge difference between the power consumption of Integer and AVX code, Intel is >introducing new base and Turbo Boost frequencies for all their SKUs; these are called AVX >base/Turbo. For example, the E5-2693 v3 will start from a base frequency of 2.3GHz and turbo up >to 3.3GHz when running non-AVX code. When it encounters AVX code however, it will not able to >boost its clock to more than 3GHz during a 1 ms window of time. If the CPU comes close to thermal >and TDP limits, clock speed will drop down to 1.9GHz, the "AVX base clock". http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5

Oct 16 '15 21:10 ozabluda

@ozabluda Ah, I did not know about this feature of Xeon processors, thanks. So it is Xeon only? soumith's Core(TM) i7-5930K will not have this? My i7-5775C seems to sustain AVX2 256-bit FMA instructions at regular turbo boost speed with liquid cooling.

Oct 16 '15 21:10 andravin

I tracked down AVX base frequency specs for haswell e5 processors here: https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/

Would be nice to find an official Intel source. I suspect this is only a feature of the big Xeon chips.

Oct 16 '15 21:10 andravin

@soumith What command line did you use? README.txt says:

For timing
#> ./build/tools/caffe time \
   -iterations <number of iterations> \
   --model=models/intel_alexnet/train_val.prototxt

When I run that on my 4-core i7-5775C I get:

I1016 16:39:40.395843 15816 caffe.cpp:333]      conv1   forward: 379.242 ms.
I1016 16:39:40.395848 15816 caffe.cpp:336]      conv1   backward: 354.405 ms.
[...]
I1016 16:39:40.396093 15816 caffe.cpp:341] Min Forward pass: 2879.16 ms.
I1016 16:39:40.396098 15816 caffe.cpp:343] Min Backward pass: 5410.64 ms.
I1016 16:39:40.396102 15816 caffe.cpp:345] Min Forward-Backward: 83316 ms.
I1016 16:39:40.396107 15816 caffe.cpp:347] Total Time: 83316 ms.
[...]
Total FP jobs:8192 jpt:2048 residue:0
Total BP jobs:106496

Most telling are the Total FP/BP jobs numbers, which are exactly equal to 256X the values in your log file. 256 is the batch size specified in train_val.prototxt.

Oct 17 '15 00:10 andravin

@soumith Oh I see now you are using your own prototxt file, not the one that was provided by Intel. Obviously there is something wrong that is causing your prototxt to use minibatch size 1.

Oct 17 '15 03:10 andravin

Actually I get reasonable numbers using your alexnet.prototxt too. So I am not sure what is wrong with your setup.

Oct 17 '15 04:10 andravin

@andravin:

Ah, I did not know about this feature of Xeon processors, thanks. So it is Xeon only? My i7-5775C seems to sustain AVX2 256-bit FMA instructions at regular turbo boost speed with liquid cooling.

I think all CPUs have it, if they overheat. Liquid cooling helps (I notice dthat with my liquid cooled Haswell as well. Can your CPU run AVX2 256-bit FMA instructions at regular turbo boost speed on all cores simultaneously or just one?

I tracked down AVX base frequency specs for haswell e5 processors here: [microway]

This is awesome, thank you.

Oct 17 '15 04:10 ozabluda

@moskewcz:

i think there are some issues with your per-layer analysis in your second comment. firstly, i don't think we can trust the per-layer #s from the caffe log too much; for example the pack+relu1 times are >> the conv1 time, so i'd assume there's some timing wonkiness there -- time and/or work being shifted among layers for example.

I think something caused conv layers to report time per image, while everything else is per minibatch.

but, perhaps more importantly (and confusingly):

the 1684 ms is for 10 iterations/batches. this is the value that got corrected to ~2680ms, with a >corresponding 268ms forward+backward per batch. confusingly, the other two #s for forward and >backward (the ~73ms back / ~91ms fwd) are per single iteration/batch.

My calculations are per-layer. Total Forward/Backward are also calculated from per-layer (reported numbers are all screwed up), exactly as you suggest.

[...] so for your forward-backward line you probably want to add the times of the forward and >backwards lines and ignore the overall combined time. alternately you could divide it by the >iteration count which will yield a similar value. 2) the 268ms is for a 128 image batch, not a single image.

I ignore 2680/268 number.

i believe your flop #s are for a single image

that's right.

(i have ~6.1GF for the no-groups regular alexnet per image, so i'd guess that your 4.2GF / image is right for the 'original' 2-groups version), so you're off by a factor of 128 in flops.

I have 4.231 GF/image for the 'original' 2-groups version and 4.285 GF/image for the "One weird trick" 1-col version, ignoring biases, LRN, activations, pooling, and loss. Your 6.1 GF/image is probably the 'original' 2-groups version without groups, but it's not what 1-col version is (the number of filtermaps is different).

PS: using 268ms / batch, and 4.2GF / image, that yields a still-implausible ~2TF/s for the 6-core >digits box, and again it seems to disagree with the more-reasonable intel announced #s, so i'm still >assuming benchmarking error.

My calculated "total time" conv*128+fc comes to 4524 ms/minibatch. I ignore 268, because it doesn't correspond to anything in the per-layer I can think of. 90ms and 72ms correspond to the sum, but is incorrect because conv is per image and everything else is per minibatch.

Oct 17 '15 06:10 ozabluda

@andravin thanks for the log on your side. I suppose doing pure-benchmarking instead of having that lmdb data layer before might be having side-effects on the intel caffe. I'll follow-up on Monday.

Oct 17 '15 22:10 soumith

@ozabluda Here are official Intel documents about avx and frequencies for Xeon E5 v3, does not mention other processors, which of course leaves us wondering: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf

Still haven't found anything authoritative for i7. Probably have to ask Intel.

Also I want to make clear that I think @soumith 's log file indicates that the batch size was just 1 image. Not sure why his alexnet.prototxt gives me batch size 128 and behaves differently for him.

Oct 18 '15 00:10 andravin

@andravin>Here are official Intel documents about avx and frequencies for Xeon E5 v3, does not mention other processors, which of course leaves us wondering:

Thank you. These are good. I think all Intel CPUs are in practice limited only by TDP (which liquid cooling helps with), even though Intel also lists current and power limits. Overclocked Intel CPUs are known to suck 350W on Prime95 without damage and 400W, maybe with long-term damage, and Intel CPUs don't prevent it, if cooled.

The doc says that AVX will never go over "AVX Max All Core Turbo" (even though the doc implies that it should only be true for AVX2.

Also I want to make clear that I think @soumith 's log file indicates that the batch size was just 1 image.

I don't think so. There is:

input_dim: 128 Top shape: 128 3 227 227 (19787136)

https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L5 https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L183

and timings for all non-conv layers (fc, relu, pool) look like they are for minibatch size=128. It looks more like only conv layer timings are per image for whatever reason.

Oct 18 '15 03:10 ozabluda

@soumith:

@moskewcz 3.77TF/s doesn't hold true if you switch to FFT or Winograd based convolutions. http://arxiv.org/abs/1509.09308

This is pretty awesome, @andravin

Oct 19 '15 23:10 ozabluda

Thanks, @ozabluda I'm looking forward to the first truly efficient implementations of the fast Winograd convnet algorithms. The first draft of the paper was just a teaser. ;-)

Oct 20 '15 04:10 andravin

For one 4x4 block, F(2x2, 3x3), standard direct convolution uses (3 * 3) * (2 * 2)=36 multiplications, and 6*4=24 additions, for the total 36+24=60 FLOP.

Ignoring amortized filter [1], and amortized inverse [2] @andravin's implementation of Winograd's convolutions uses 4*4=16 multiplications, and 80/3 amortized additions for the total of 16+80/3=42+2/3 FLOP.

Utilization is 60/(16+80/3)= 140.625%, which is how he gets results in Table 6 (max efficiency 134.0% on conv4.2).

I tried counting absolute minimum number of amortized additions, ignoring filter [1], inverse [2], assuming infinite image and infinite CPU registers.

I counted 24 additions for data per block.

This gives us 16+24=40 FLOP. Compared to standard direct 60 FLOP, we have 60/40= 150% max possible utilization.

[1] Paper says filter transform uses 28 FLOP per input channel. For conv4.2 image is 24x24, which makes filter FLOP negligible.

[2] Paper says inverse transform uses 24 additions, amortized over input channels, which is negligible for all layers, except conv1.2, but even there it's a win 60/(16+24+24/3)=1.25, not sure which is why it is not in Table 6.

C^TdC=
d00−d20−d02+d22	d20−d22+d10−d12	d20−d22−d10+d12	d10−d12−d30+d32
d01−d21+d02−d22	d21+d22+d11+d12	d21+d22−d11− d12	d11+d12−d31−d32
−d01+d21+d02−d22	−d21+d22−d11+d12	−d21+d22+d11−d12	−d11+d12+d31−d32
d01−d21−d03+d23	d21−d23+d11−d13	d21−d23−d11+d13	d11−d13−d31+d33

Oct 20 '15 23:10 ozabluda

@ozabluda Thanks, one thing I think you are missing is that transformed data can be re-used for convolution with every filter. So the data transform FLOPs can be amortized over the number of filters.

Anyway I don't want to hijack this issue, so please continue the conversation of Winograd convnet algorithms at https://www.reddit.com/r/MachineLearning/comments/3nocg5/fast_algorithms_for_convolutional_neural_networks/

Oct 21 '15 01:10 andravin

@andravin Aha! This is why you keep referring to only the number of multiplications in "arithmetic complexity reduction".

Oct 21 '15 16:10 ozabluda

Ok, so today I finally finished building my caffe lmdb for imagenet, and I ran the intel benchmarks with the lmdb data layer etc. etc. (just like how they want it to be).

The numbers are not as impressive anymore (as expected).

Caffe + MKL - ~ 5100 ms
IntelCaffe + MKL - ~ 3052 ms
Speedup: 1.67x

References: Caffe + MKL: https://github.com/soumith/convnet-benchmarks/blob/cpu/caffe/output_alexnet_mkl.log#L320-L329

IntelCaffe + MKL: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L362-L371

Oct 22 '15 05:10 soumith

In related news, I just finished the first winograd fprop/bprop fp32 kernel. It is fully fused and requires no additional memory. But the big news is that it runs fastest at a minibatch size of 8. And by fast I mean close to 10 virtual Tflops. It is full utilization and is primarily power limited. The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops.

I have a fair amount of tuning I want to try with it to see if I can boost cache performance, then I'll move on to the grad weight update kernel which I already have sketched out.

Then after that I'll take a stab at the bigger and faster transforms. I'm hoping to hit about 3x the performance of direct conv, and also at N=8. But those will be much trickier to fit in a fully fused kernel.

Special thanks to @andravin for lots of fruitful discussion on bringing this about.

Oct 22 '15 05:10 scott-gray

@scott-gray that sounds super exciting. Cant wait to bench it.

Oct 22 '15 05:10 soumith

I second that :) Can't wait to try this out! With the pervasiveness of 3x3 convolutions nowadays, this could be a game changer.

Oct 22 '15 08:10 benanne

Hi, I'm one of the developers who worked on this package. I've looked at the run.sh and the only suggestion I have is to enable OpenMP thread affinity by setting KMP_AFFINITY=compact,granularity=fine (assuming that the CPU has HyperThreading enabled). This probably should be done for the baseline run as well.

Looking at the logs, I see that the speedups for the convolution layers are not as high as we'd expect, but we never ran on a 6-core machine, so maybe our expectations are wrong. The CPU convolution layers often call tall and skinny SGEMMs which have limited scalability for a 2x18-core machine. But on a 6-core machine the gap between the SGEMM-based convolution and the approach we used may be much more narrow.

Also, it's weird that the fc layers run slower in the new package because we did not modify that code.

Here's a link to a whitepaper explaining how CPU changes frequency when executing AVX2 instructions.

Oct 22 '15 13:10 rsdubtso

Thanks, @rsdubtso We found that whitepaper but it only explicitly mentions Xeon E5 v3 processors. Are other processors (eg i7) affected by AVX2 frequencies, if so where can we find documentation of the AVX2 frequencies for those processors? Anyway I opened an Intel forum ticket for it here: https://communities.intel.com/thread/87851

Oct 22 '15 16:10 andravin

Nice work Scott! Looking forward to playing with it

Oct 22 '15 16:10 nicolasvasilache

@scott-gray strikes again! Well done.

Oct 22 '15 17:10 andravin

@soumith @rsdubtso

The numbers are not as impressive anymore (as expected). Caffe + MKL - ~ 5100 ms IntelCaffe + MKL - ~ 3052 ms Speedup: 1.67x

Actually, there is tremendous improvement in the convolutional layers (still far from 614 GFLOP/s peak), even bigger improvements in pool and activation layers (don't matter much), huge regression in fc layers - should be easy to fix, and huge regression in data layer (even easier to fix). Caffe/MKL is also much faster that Caffe/OpenBLAS you benchmarked earlier. There is a timing bug in conv1 backward (implausible GFLOP/s). Also, this benchmark was run with minibatch=256, inconsistent with all others, where minibatch=128.

i7-5930K (614 GF/s)
		OpenBLAS		CaffeMKL		IntelCaffe
	MF	ms	GF/s	ms	GF/s	ms	GF/s
conv1 forward:	141	304.795	59.2	357.069	101.1	88.196	409.3
conv1 backward:	282	536.807	67.2	330.767	218.3	93.893	768.9
conv1/relu forward:		21.8936		47.1887		7.79
conv1/relu backward:		28.5025		57.7991		12.544
pool1/3x3_s2 forward:		85.0495		216.237		8.542
pool1/3x3_s2 backward:		45.7551		92.3998		18.194
conv2/5x5_s1 forward:	448	361.393	158.7	533.456	215.0	251.792	455.5
conv2/5x5_s1 backward:	896	687.499	166.8	1007.32	227.7	684.775	335.0
cpnv2/relu forward:		15.8821		32.5319		5.629
cpnv2/relu backward:		20.6075		41.8427		9.129
pool2/3x3_s2 forward:		67.7179		138.228		6.104
pool2/3x3_s2 backward:		35.3347		71.5279		13.084
conv3/3x3_s1 forward:	224	254.672	112.6	207.55	276.3	126.165	454.5
conv3/3x3_s1 backward:	448	385.527	148.7	415.731	275.9	285.18	402.2
conv3/relu forward:		7.8402		15.1693		2.503
conv3/relu backward:		9.814		19.5894		4.08
conv4/3x3_s1 forward:	299	424.084	90.2	321.758	237.9	169.798	450.8
conv4/3x3_s1 backward:	598	660.748	115.8	658.584	232.5	382.091	400.7
conv4/relu forward:		5.3955		10.159		1.559
conv4/relu backward:		6.7793		13.0562		2.705
conv5/3x3_s1 forward:	199	282.846	90.1	218.284	233.4	113.286	449.7
conv5/3x3_s1 backward:	398	428.887	118.8	435.634	233.9	256.046	397.9
conv5/relu forward:		5.4022		10.1855		1.583
conv5/relu backward:		6.4006		12.9499		2.86
pool5/3x3_s2 forward:		34.1529		53.2655		1.958
pool5/3x3_s2 backward:		15.0692		31.1371		4.043
fc6 forward:	75	41.5847	232.4	42.6512	453.1	72.547	266.4
fc6 backward:	151	79.5084	243.1	77.2451	500.4	132.581	291.6
fc7 forward:	34	18.6208	230.7	20.1991	425.3	33.675	255.1
fc7 backward:	67	29.3293	292.9	37.3001	460.6	61.519	279.3
fc8 forward:	8	4.2152	248.8	5.7591	364.1	8.904	235.5
fc8 backward:	16	7.5515	277.7	9.0703	462.4	15.573	269.3
Average Forward	1428	1935.58	94.4	2259.9	161.8	1026.8	356.1
Average Backward	2856	2984.16	122.5	3312.14	220.8	1982.95	368.8
Average Forward-Backward:	4285	4919.8	111.5	5572.1	196.9	30845	355.6
Total Time:				55721		30845

Oct 22 '15 19:10 ozabluda

@ozabluda just note that IntelCaffe uses "minimum time over all runs" for the per-layer numbers, whereas regular Caffe uses "average time over all runs". That's one reason why I didn't do a per-layer breakdown.

Oct 22 '15 19:10 soumith

@soumith, does it really matter? Typically only first iteration differs much from others.

Oct 22 '15 19:10 ozabluda

@scott-gray:

In related news, I just finished the first winograd fprop/bprop fp32 kernel. It is fully fused and requires no additional memory. But the big news is that it runs fastest at a minibatch size of 8. And by fast I mean close to 10 virtual Tflops. It is full utilization and is primarily power limited. The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops. I have a fair amount of tuning I want to try with it to see if I can boost cache performance,

Awesome. Fastest at a minibatch size of N=8 is awesome, but weird (cache performance?), because, in addition to less work for GPU, you amortize filer transforms over smaller N.

10 virtual Tflops/6.144 actual Tflops=163% "utilization" (using @andravin's terminology). Why so little? Gimme, gimme, gimme :-). Assuming you implemented F(2x2,3x3), max theoretic utilization [1] is (60+4)/(16+4)=320% [2] For N=8, we can't neglect Filter transform (28 FLOP): (60+4)/(16+4+28/8)=272%

[1] by my calculation, different from paper, please correct me if I am wrong

[2] except for the first layer (conv1.1), where we can't neglect 24 FLOP in Inverse transform amortized over only C=3 input channels, and there are only 2 reductions per output per input channel, for the total max theoretical utilization: (60+2 * 4/3)/(16+(2 * 4+24)/3)=235%. For N=8 (60+2 * 4/3)/(16+(2 * 4+24)/3 + 28/8) = 208%. On GPU (but not CPU) conv1.1 is i/o bound anyway, so no utilization improvement is actually possible.

Oct 22 '15 20:10 ozabluda

@rsdubtso your suggested flags didn't make much difference -- IntelCaffe went from 3052 ms to 3000 ms

Oct 22 '15 20:10 soumith

@ozabluda: Yes this is F(2x2,3x3). This requires a batch of 16 gemms. I'm able to fit this all in one block for K=32 and 4 overlapping coordinates of x,y each with with 8 units of minibatch. So really it's 16 32x32 tiles. The in block overlap is key as that's what gives you such a high L1 hit rate, otherwise the you'd be bandwidth bound on L2. I use the standard 8 register gemm blocking so that means 64 FFMA's per outer product. But instead of having 1 big loop of 8 outer products, I split it in two loops of 4 (or 256 FFMAs each). 1 loop (128 threads) does the image transform inline and the other (128 threads) the filter transform (256 threads total). I can fit two blocks on an SM to cover bar.sync latencies.

Anyway, all the transform logic, pointer arithmetic, predicating and loop logic requires 138 clock consuming instructions interspersed with the FFMAs. This drops performance about 138/512=27%. This kernel is so dense with memory operations (even if they're mostly cache hits) that there's little opportunity for the boost clock to add much performance. I even have a bit of instruction cache thrashing going on because the total loop size is slightly over the size of the instruction cache.

With more shared memory and/or registers I'd have a lot more headroom to increase the tile size a bit and reduce bandwidth (as well as transform overhead). Perhaps Pascal will provide that.

Anyway, I'll have a much more detailed write up forthcoming (probably as an addition to Andrew's paper).

Oct 22 '15 21:10 scott-gray

Awesome, looking forward to that!

Oct 22 '15 21:10 benanne

@rsdubtso,

Hi, I'm one of the developers who worked on this package. [...] Looking at the logs, I see that the speedups for the convolution layers are not as high as we'd expect, but we never ran on a 6-core machine, so maybe our expectations are wrong. The CPU convolution layers often call tall and skinny SGEMMs which have limited scalability for a 2x18-core machine. But on a 6-core machine the gap between the SGEMM-based convolution and the approach we used may be much more narrow.

Thank you very much for your work. Intel's announcement was on on 2xE5-2699v3 (18 core):

Forward pass: 1449 MFLOP * 731images/1sec=1.059 TFLOP/s Forw+Backw pass: 3 * 1449 MFLOP * 271 images/1sec=1.187 TFLOP/s

I estimate peak FLOP/s for E5-2699v3 like so: 18 cores * 32 FLOP/cycle * 1.9GHz (AVX base clock) =1.094 TFLOP/s

Performance numbers above look to me much more like ~100% utilization on 1 CPU with very poor scalability to 2 CPUs than 50% utilization on 1 CPU with excellent scalability to 2 CPU. It would be nice to have performance numbers for 1 CPU (especially more "normal" 16core E5-2698v3 and such).

FWIW, on my dual 8-core E5-2640v3 2.60 GHz, scalability of Caffe/OpenBLAS to the second CPU is almost zero. On 1 CPU, scalabity from 4 to 8 cores is so poor that my 4-core i5-4670K 3.50 GHz outperforms it by 1.5-2.2x in convolutional layers. I didn't try MKL yet.

Oct 22 '15 21:10 ozabluda

@ozabluda F(2x2,3x3) has a maximum speedup of (2x2x3x3)/(4x4) = 2.25. In general the max speedup for F(mxn, rxs) is (m n r s) / ((m+r-1)(n+s-1))

Oct 22 '15 22:10 andravin

@andravin

F(2x2,3x3) has a maximum speedup of (2x2x3x3)/(4x4) = 2.25

This is the number in the paper from below (9). But this is multiplications only. What about additions? Standard direct convolution also uses 6*4=24 additions, for the total 36+24=60 FLOP.

Both direct and Winograd convolutions also use 4 unamortized additions (1 reduction per output, asymptotically for K>>1), for the max possible utilization: (60+4)/(16+4)=3.2

It's quite possible that I got it wrong. Grateful for corrections.

Oct 22 '15 23:10 ozabluda

Because multiplication, addition, and multiply accumulate all have the same throughput, I count them all equally. That not only makes the analysis simpler, but gives you a more accurate accounting of how many arithmetic instructions you will need to execute in order to implement the algorithm.

In any case, if you wanted to count FLOPs instead of floating point instructions (FLIPs?), you would have to count the additions used in the reductions across channels, which make up for the additions that are missing from your accounting of Winograd convolution FLOPs.

Oct 23 '15 00:10 andravin

I see. It would totally make sense to introduce FLIP, AKA "arithmetic complexity" from your paper. Direct uses 36 FLIP and Winograd F(2x2,3,3) uses 16 FLIP. We can't use "FLOP", because existing terminology is too ingrained with 2 FLOP per 1 MAC/FMAC. For example, in this thread, this is how Titan-X has 6.144 TFLOP/s (3.072 TFLIP/s), Haswell can do 32 FLOP/clock (16 FLIP/clock), AlexNet(1-col) Forward pass w/direct convolutions has 1428 MFLOP (714 MFLIP) per image, etc.

FLIP (additions) may be slightly cheaper that FLOP, because they generate less heat, and CPU/GPU may clock higher, maybe for E5-2699v3 in the range 1.9GHz (AVX base clock) vs 2.6 GHz (AVX Max All Core Turbo Frequency) 2.6/1.9=1.4

@scott-gray's, sorry I couldn't quite follow, do you count this in FLOP or FLIP:

138 clock consuming instructions interspersed with the FFMAs. This drops performance about 138/512=27%

FLIP: 2.25 * (1-138/512)=1.64 (i.e. you are already at the theoretical (practical) max) FLOP: 3.2 * (1-138/512)=2.34 (i.e ???)

Oct 23 '15 03:10 ozabluda

2.25(1-138/512)=1.64 was how I was calculating it. Basically any instruction in the gemm loop that isn't dual issued dilutes the number of FFMA's that can be processed. In this case there are a lot (138) but it turns out to not be such a bad thing as this kernel is right on the edge of being bandwidth limited. I'm working on a scanning back and forth square wave block id remapping to see if that increases cache hits and drops power use a bit so the boost clock can kick in more. I'll also do the fp16 version too. I'm pretty sure that will have a lot more headroom. Though this kernel has very few remaining instruction slots to insert the F2F.F32.F16s (dual issued) but I'm pretty sure I can squeak them in.

Oct 23 '15 03:10 scott-gray

@andravin,

Are other processors (eg i7) affected by AVX2 frequencies, if so where can we find documentation of the AVX2 frequencies for those processors?

Probably the CPU support folks will have a better answer that I can find.

@soumith,

your suggested flags didn't make much difference -- IntelCaffe went from 3052 ms to 3000 ms

Thanks. That means that the OS already did a fine job scheduing the threads...

@ozabluda,

Performance numbers above look to me much more like ~100% utilization on 1 CPU with very poor scalability to 2 CPUs than 50% utilization on 1 CPU with excellent scalability to 2 CPU. It would be nice to have performance numbers for 1 CPU (especially more "normal" 16core E5-2698v3 and such)

Thanks, this is an interesting observation. If I interpret you correctly, you're saying that we're running at 50% efficiency of the whole machine. My anecdotal evidence from the times when we were tuning the benchmark, is that the conv layers do scale pretty well with the number of threads, but I do not have any numbers handy. I'll post new numbers from single socket runs next week.

I'll also try to find out if we see the regression in fc layers in our setup. This came quite as a surprise...

Oct 23 '15 14:10 rsdubtso

@scott-gray, I think your 10 vTFLOP/s, 163% utilization for F(2x2,3x3), K32xN8 is more impressive that you modestly describe. For N>>1, K>>1 max theoretical utilization is 36/16=2.25. But for N=8, you can't neglect Fliter transform (28 FLIP from the paper), amortized over N. For K=32, you can't neglect data transform (32 FLIP from the paper), amortized over K. 36/(16+28/8+32/32)=1.76 [1]. Those 28+32=60 FLIP are part of your 138 "apparent overhead" instructions, but only 138-60=78 instructions is "true overhead", i.e. 78/512=15%. Or I am counting this incorrectly: I should be assuming that Filter transform could have been reused across tiles (when N>8), ditto with data transform (when K>32)?

[1] As @andravin describes in the paper, data transform over overlapped regions can be reused, and instead of 32, I counted min theoretical 24 additions (in his CPU implementation he has 80/3=26.7 additions), but it doesn't matter in this case: 36/(16+28/8+24/32)=1.78

The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops.

this must be a typo. For C=3, you can't neglect inverse, amortized over C, which is 24 additions (from the paper): 36/(16+28/8+32/32+24/3)=1.26. 1.26 * 6.144 TF/s = 7.76 vTF/s. You probably meant 6.7 vTflops.

Oct 23 '15 17:10 ozabluda

Sharing transform code from overlapped regions is much harder in practice than it may seem. Working around all the constraints means you need an additional pass through shared memory or perhaps warp shuffles.. which adds more overhead. What I have right now with each thread computing one transform works pretty well.

As far as performance goes, the bigger your tile size the fewer repeated transforms you have to make, but there's only so much I can fit in the limited shared memory available. I could do a non-fused kernel and only do the minimum number of transforms but that also adds a lot of overhead and I'm pretty sure it wont run as fast as a fused kernel that is computing extra transforms inline. On the filter side there's also the benefit of only needing 9 loads instead of 16 if you do the transform inline.

But counting the non-ffma clock consuming instructions is all you need to figure out max performance. And this bares out in my testing.

For C=3, that's just 1 pass through the gemm loop at 3/4 utilization. But all the gemm setup + ouput code also has overhead that's hugely amplified by such a small time spent in gemm.

Oct 23 '15 17:10 scott-gray

@scott-gray on reddit:

Once I have these kernels well tuned I'll move onto the much more complicated F(4x4,3x3) transform where as much as a 4x speedup may be possible (though on the GPU there's no avoiding the small transform overhead or the inefficiencies in the awkward 6x6 dimensions).

As paper says, for F(4x4,3x3), everything infinite, asymptotically, speedup would indeed be 144/36=4.

But for K32xN8 tiling block (can it be that large?), taking floating point instruction counts from the paper, theoretical max possible utilization for is:

For C>>3: 144/(36+156/32+72/8)=2.9 For C==3: 144/(36+156/32+72/8+90/3)=1.8

Oct 24 '15 17:10 ozabluda

I clarified the tiling I was using in a later post here. It's actually 32x32, not 32x8. The 32x8 is what is visible to the user, but 32x32 is how it actually works. The outer product dims of the batched gemm are K and Y/4_X/4_N. So I don't just have 8 points of N on the outer product, but 4 sets x,y coordinates of 8 points of N arranged in a 2x2 superblock. With the 2 units of overlap in each direction, this hugely increases the utilization of the L1 cache and its what makes it possible for this kernel to have such dense global loads (16 loads in ~256 cycles is a lot).

I'm actually working on a 2x1 superblock for fp16 (2xy points of 16n) so as to eliminate the half empty 32 byte transaction size.

Oct 24 '15 18:10 scott-gray

I clarified the tiling I was using in a later post here. It's actually 32x32, not 32x8. The 32x8 is what is visible to the user, but 32x32 is how it actually works.

I think I kinda understood a little bit the main idea how you get high L1 utilization, removing L2 bottleneck, but I don't understand how this can help with max theoretical peak FLIP/s calculation I am making. You still can't amortize filter transform over "effective" N=32, only over real N=8. Or can you?

Oct 24 '15 19:10 ozabluda

x and y also factor into the number of image transforms you need, not just n. So 32 is the unit you need to use when calculating redundant transforms.

Oct 24 '15 19:10 scott-gray

Aha! I get it now.

For F(4x4,3x3) correct formula for K32xN8, X2xY2(=4) is

For C>>3: 144/(36+156/32+72/4/8)=3.3 For C==3: 144/(36+156/32+72/4/8+90/4/3)=2.8

For the overlapped data transform, the correct number of FLIP is actually smaller than 156.

Last convolutional layer of VGG image dimention is 6x6, preventing X2xY2 superblock tiling. For that layer:

For C>>3: 144/(36+156/32+72/8)=2.9

For F(2x2,3x3) correct formula for K32xN8, X2xY2(=4) is

For C>>3: 36/(16+32/32+28/4/8)=2.01 (1.63 actually achieved) For C==3: 36/(16+32/32+28/4/8+24/4/3)=1.81

For the overlapped data transform, the correct number of FLIP is actually smaller than 32, but, since it's at least 24 (by my calculation), it doesn't matter for K=32:

For C>>3: 36/(16+24/32+28/4/8)=2.04 For C==3: 36/(16+24/32+28/4/8+24/4/3)=1.83

Oct 24 '15 20:10 ozabluda

@scott-gray

The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops.

initially, I though this was a typo (as Titan-X has 6.144 real Tflops). Now I think it may mean an awesome 1.7 utilization (C=3, theoretical max utilization is 1.8, see previous comment), although it's weird, because C=3 is i/o bound.

Oct 24 '15 23:10 ozabluda

@ozabluda, Here's some data from a 2xE5-2697v3 machine (sorry, could did not have a desktop machine with a proper OS handy).

My colleague timed IntelCaffe on 14 and 28 cores (1 and 2 sockets). Affinity setup: KMP_AFFINITY=granularity=fine,compact,1,0. MKL version was 11.3.0.

There are quite a few cases when the ratio is less than 2 and even some cases where it is less than 1, but the most time-consuming layers have scaled pretty well. The total ratio is 1.83.

layer	dir	14	28	Ratio 14/28
data	forward:	79.75	69.87	1.14
data	backward:	0.00	0.00	N/A
pack1	forward:	9.35	4.74	1.97
pack1	backward:	0.00	0.00	2.00
conv1	forward:	133.78	66.79	2.00
conv1	backward:	101.10	56.35	1.79
relu1	forward:	12.29	5.94	2.07
relu1	backward:	17.14	8.42	2.04
norm1	forward:	45.06	22.82	1.97
norm1	backward:	67.51	32.20	2.10
pool1	forward:	16.82	8.57	1.96
pool1	backward:	27.53	13.77	2.00
conv2	forward:	163.55	107.55	1.52
conv2	backward:	416.93	208.84	2.00
relu2	forward:	7.83	3.77	2.08
relu2	backward:	10.92	5.34	2.05
norm2	forward:	28.15	14.92	1.89
norm2	backward:	43.41	20.60	2.11
pool2	forward:	10.23	5.31	1.93
pool2	backward:	17.34	8.76	1.98
conv3	forward:	105.66	52.77	2.00
conv3	backward:	228.13	114.76	1.99
relu3	forward:	2.22	1.02	2.17
relu3	backward:	3.63	1.82	1.99
conv4	forward:	81.77	40.92	2.00
conv4	backward:	176.46	88.66	1.99
relu4	forward:	2.21	0.88	2.51
relu4	backward:	3.71	1.81	2.04
conv5	forward:	56.16	28.06	2.00
conv5	backward:	120.98	60.74	1.99
relu5	forward:	1.26	0.42	2.96
relu5	backward:	2.79	0.82	3.40
pool5	forward:	2.30	1.06	2.17
pool5	backward:	3.82	0.81	4.73
unpack6	forward:	0.35	0.17	2.02
unpack6	backward:	0.24	0.19	1.30
fc6	forward:	23.39	13.01	1.80
fc6	backward:	41.84	22.64	1.85
relu6	forward:	0.06	0.14	0.38
relu6	backward:	0.12	0.05	2.58
drop6	forward:	4.06	4.04	1.00
drop6	backward:	0.12	0.07	1.68
fc7	forward:	11.14	5.99	1.86
fc7	backward:	18.95	10.08	1.88
relu7	forward:	0.05	0.13	0.39
relu7	backward:	0.09	0.05	1.90
drop7	forward:	4.04	4.04	1.00
drop7	backward:	0.16	0.09	1.69
fc8	forward:	2.59	1.87	1.38
fc8	backward:	5.26	3.10	1.70
loss	forward:	11.03	11.27	0.98
loss	backward:	0.18	0.22	0.79
Min	Forward	818.78	477.70	1.71
Min	Backward	1309.69	661.51	1.98
Min	Forward-Backward:	21607.00	11820.00	1.83

Oct 27 '15 06:10 rsdubtso

@rsdubtso Taking the minimum timing of each layer rather than the average is a bit misleading and is not a standard in benchmarking. I think you should consider changing that, even though the overall difference might be minor.

Oct 27 '15 14:10 soumith

@rsdubtso, thank you, these are great. I see great scalability to 2 sockets, with ~50% utilization (either one or two sockets), exactly opposite of my earlier guesses. Next natural experiments would be to run it on 1,2,4,8 cores to see where utilization breaks down (are you using AVX2 MADD?)

E5-2697v3 has: 14 cores * 32FLOPs/cycle * 2.2 GHz (AVX Core Freq) = 986 GFLOP/s (AVX boost goes from 2.9-3.3 GHz, depending on the number of cores active)

Note that you ran 2-col AlexNet with minibatch=256, while @soumith ran 1-col AlexNet with minibatch=256

IntelCaffe	mb=256	E5-2697v3
		14 core		28 core
	MFLOP	ms	Utilization	ms	Utilization
conv1 forward:	211	133.78	41%	66.79	41%
conv1 backward:	422	101.1	108%	56.35	97%
conv2 forward:	448	163.55	71%	107.55	54%
conv2 backward:	896	416.93	56%	208.84	56%
conv3 forward:	299	105.66	73%	52.77	74%
conv3 backward:	598	228.13	68%	114.76	68%
conv4 forward:	224	81.77	71%	40.92	71%
conv4 backward:	448	176.46	66%	88.66	66%
conv5 forward:	150	56.16	69%	28.06	69%
conv5 backward:	300	120.98	64%	60.74	64%
fc6 forward:	75	23.39	84%	13.01	75%
fc6 backward:	75	41.84	47%	22.64	43%
fc7 forward:	34	11.14	79%	5.99	74%
fc7 backward:	34	18.95	47%	10.08	44%
fc8 forward:	8	2.59	80%	1.87	56%
fc8 backward:	8	5.26	39%	3.1	34%
Conv+fc Forward	1449	578.04	65%	316.96	59%
Conv+fc Backward	2781	1109.65	65%	565.17	64%
Conv+fc Forward-Backward:	4231	1687.69	51%	882.13	46%

Oct 27 '15 18:10 ozabluda

@rsdubtso The data transfer time seems very long and does not scale well. Could you offer more details how is it designed? Thanks!

Oct 27 '15 19:10 gujunli

@rsdubtso Also, the relu forward seems much slower on two sockets. Why is that? Drop6 and Drop7 seems to still use one socket even when you have two socket? the scaling ratio is 1.

Oct 27 '15 19:10 gujunli

@andravin

Are other processors (eg i7) affected by AVX2 frequencies, if so where can we find documentation of the AVX2 frequencies for those processors?

Probably the CPU support folks will have a better answer that I can find.

I asked around, and here's what I was told: AVX frequency is not SW visible. But even desktop processors have a fused 'AVX' frequency that they throttle down to when executing heavy instructions. I could not find the frequency fused for the i7 CPU mentioned above, but you can find it out using prime95 v27.9 or later, for example. However, current-related throttling may occur earlier than you hit TDP budget limit related to heavy instructions.

Oct 28 '15 06:10 rsdubtso

Hi all, I worked with @rsdubtso on the package too.

@soumith, you are right, we should've pointed we report timings for the fastest iteration. Though, if you use the same package for comparing 'intel_alexnet' and 'bvlc_alexnet', the comparison will be quite representative.

@gujunli relu1-5 scale well, relu6-7 seem to be too small for scaling across sockets. drop6 and drop7 use rng (not parallelized), which most likely takes most time. We didn't optimize drop layer, except for adding parallelization on the loop.

@ozabluda, @gujunli, I rerun the package on the same machine @rsdubtso did. The only change here is that I put database on /tmp (local hard drive). @rsdubtso reported timings when the DB was on Lustre FS (distributed cluster filesystem). That was the reason, why the timings were pure for data layer. We didn't change data layer much, only added simple parallelization on preparation of image-minibatch.

Iterations: 10

layer	direction	omp	omp	omp	omp	omp	cmp	cmp	cmp	cmp
		28	14	8	4	2	28.vs.14	14.vs.8	8.vs.4	4.vs.2
--------	--------	--------	--------	--------	--------	--------	--------	--------	--------	--------
data	forward:	18.51	23.93	24.57	24.94	29.82	0.64	0.58	0.50	0.59
data	backward:	0	0	0	0.00	0	N/A	N/A	N/A	0
pack1	forward:	4.70	9.41	10.11	14.26	25.03	1.00	0.61	0.70	0.87
pack1	backward:	0.00	0.00	0.00	0.00	0.00	0.5	0.57	0.5	0.5
conv1	forward:	66.96	133.95	212.41	345.24	612.40	1.00	0.90	0.81	0.88
conv1	backward:	56.47	101.13	170.32	328.05	650.71	0.89	0.96	0.96	0.99
relu1	forward:	5.96	12.46	12.29	12.86	19.19	1.04	0.56	0.52	0.74
relu1	backward:	8.4	17.15	17.41	20.12	32.11	1.02	0.58	0.57	0.79
norm1	forward:	22.99	44.78	64.66	126.46	251.12	0.97	0.82	0.97	0.99
norm1	backward:	31.86	63.13	67.77	95.95	169.04	0.99	0.61	0.70	0.88
pool1	forward:	8.44	16.41	27.14	54.10	106.53	0.97	0.94	0.99	0.98
pool1	backward:	13.90	27.66	27.97	34.21	54.56	0.99	0.57	0.61	0.79
conv2	forward:	105.79	164.55	282.32	561.36	1120.98	0.77	0.98	0.99	0.99
conv2	backward:	208.96	416.54	712.46	1415.67	2826.54	0.99	0.97	0.99	0.99
relu2	forward:	3.79	7.81	7.76	8.32	12.37	1.02	0.56	0.53	0.74
relu2	backward:	5.35	10.90	11.19	12.94	20.60	1.01	0.58	0.57	0.79
norm2	forward:	14.71	28.47	41.42	81.56	162.21	0.96	0.83	0.98	0.99
norm2	backward:	20.93	40.58	43.59	60.95	108.32	0.96	0.61	0.69	0.88
pool2	forward:	5.25	10.19	16.94	33.66	66.53	0.97	0.95	0.99	0.98
pool2	backward:	8.76	17.97	17.89	21.02	33.87	1.02	0.56	0.58	0.80
conv3	forward:	52.78	105.76	182.96	363.68	725.27	1.00	0.98	0.99	0.99
conv3	backward:	115.63	228.32	396.63	784.72	1562.49	0.98	0.99	0.98	0.99
relu3	forward:	1.04	2.27	2.26	2.73	4.27	1.09	0.56	0.60	0.78
relu3	backward:	1.88	3.62	3.95	4.55	7.17	0.96	0.62	0.57	0.78
conv4	forward:	40.89	81.86	139.66	275.17	547.34	1.00	0.97	0.98	0.99
conv4	backward:	88.91	176.64	301.33	595.64	1183.89	0.99	0.97	0.98	0.99
relu4	forward:	0.89	2.21	2.24	2.68	4.55	1.23	0.57	0.59	0.84
relu4	backward:	1.82	3.76	3.91	4.60	7.18	1.02	0.59	0.58	0.77
conv5	forward:	28.07	56.34	94.83	185.45	368.19	1.00	0.96	0.97	0.99
conv5	backward:	60.71	120.89	204.81	401.62	797.17	0.99	0.96	0.98	0.99
relu5	forward:	0.42	1.20	1.33	1.65	3.00	1.43	0.63	0.61	0.90
relu5	backward:	0.81	2.78	2.80	3.11	4.79	1.71	0.57	0.55	0.76
pool5	forward:	1.06	2.25	3.72	7.35	14.68	1.06	0.94	0.98	0.99
pool5	backward:	0.81	3.85	3.79	4.57	7.23	2.37	0.56	0.60	0.79
unpack6	forward:	0.16	0.37	0.48	0.81	1.59	1.17	0.73	0.83	0.98
unpack6	backward:	0.18	0.24	0.38	0.74	1.46	0.67	0.88	0.97	0.98
fc6	forward:	13.01	23.54	36.53	72.21	137.74	0.90	0.88	0.98	0.95
fc6	backward:	22.70	41.92	67.19	133.06	264.57	0.92	0.91	0.99	0.99
relu6	forward:	0.14	0.06	0.03	0.06	0.13	0.21	0.33	0.94	0.96
relu6	backward:	0.04	0.11	0.09	0.19	0.36	1.25	0.46	1.06	0.94
drop6	forward:	4.01	4.09	4.11	4.28	4.61	0.50	0.57	0.52	0.53
drop6	backward:	0.07	0.12	0.18	0.37	0.72	0.86	0.85	1.00	0.95
fc7	forward:	6.05	11.21	16.81	33.46	65.28	0.92	0.85	0.99	0.97
fc7	backward:	10.07	18.99	30.49	60.42	118.79	0.94	0.91	0.99	0.98
relu7	forward:	0.12	0.05	0.03	0.06	0.13	0.20	0.39	0.94	0.96
relu7	backward:	0.05	0.09	0.12	0.17	0.32	0.88	0.72	0.71	0.94
drop7	forward:	4.01	4.08	4.08	4.28	4.60	0.50	0.57	0.52	0.53
drop7	backward:	0.08	0.17	0.20	0.38	0.73	1.00	0.67	0.91	0.96
fc8	forward:	1.85	2.59	4.47	8.55	15.92	0.69	0.98	0.95	0.93
fc8	backward:	3.09	5.28	7.53	14.66	28.83	0.85	0.81	0.97	0.98
loss	forward:	11.19	11.00	10.82	10.91	10.73	0.49	0.56	0.50	0.49
loss	backward:	0.22	0.17	0.17	0.17	0.20	0.39	0.56	0.50	0.57
all	Forward	423.95	761.99	1206.21	2237.89	4318.62	0.89	0.90	0.92	0.96
all	Backward	662.77	1303.28	2093.61	4000.3	7884.59	0.98	0.91	0.95	0.98
all	Fwd-Bwd	11080	20916	33242	62807	122290	0.94	0.91	0.94	0.97

small comment on cmp columns: reported formula for X.vs.Y is: (time_Y/time_X)*(Y/X) -- i.e. parallelization efficiency

Oct 28 '15 07:10 emfomenk

@rsdubtso:

I could not find the frequency fused for the i7 CPU mentioned above, but you can find it out using prime95 v27.9 or later, for example. However, current-related throttling may occur earlier than you hit TDP budget limit related to heavy instructions.

Thank you for checking. Even though Intel's documentation does lists current and power limits, I think all(?) Intel CPUs are in practice limited only by TDP. For example, overclocked Intel CPUs are known to suck 400W on Prime95 likely with long-term damage, and Intel CPUs don't prevent it, if cooled:

From official Asus overclocking guide: “”” In our testing to date, the average overclocked frequency for 5960X processors is 4.5GHz. Very good processors will achieve 4.6GHz fully stable with less than 1.30Vcore. […] Users should avoid running Prime95 small FFTs on 5960X CPUs when overclocked. Over 4.4GHz, the Prime software pulls 400W of power through the CPU. “”” http://rog.asus.com/365052014/overclocking/rog-overclocking-guide-core-for-5960x-5930k-5820k/

Oct 28 '15 17:10 ozabluda

@emfomenk, thanks for the excellent table. Looking only at conv and fc layers, I see excellent scalability 2=>4=>8=>14=>28 in conv layers (except 2=>4=>8=14 in conv1 forward and 14=>28 in conv1 forward,conv2 backward), and some degradation in scalbiltiy 8=>14 and 14=>28 in fc layers. Updating my utilization table for 2 cores, we see that utilization improved conv+fc forward 65%=>73% (maybe some of it is due to AVX clock boost?), while conv+fc backward didn't improve much (65%=>68%). We can see that utilization does/doesn't improve for 2 cores. Now, the only thing missing is 1 core :-)

E5-2697v3 with 2 cores has: 2 cores * 32FLOPs/cycle * 2.2 GHz (AVX Core Freq) = 141 GFLOP/s (AVX boost goes from 2.9-3.3 GHz, depending on the number of cores active)

IntelCaffe	mb=256	E5-2697v3
		14 core		28 core		2 core
	MFLOP	ms	Util	ms	Util	ms	Util
conv1 forward:	211	133.78	41%	66.79	41%	612.4	63%
conv1 backward:	422	101.1	108%	56.35	97%	650.71	118%
conv2 forward:	448	163.55	71%	107.55	54%	1120.98	73%
conv2 backward:	896	416.93	56%	208.84	56%	2826.54	58%
conv3 forward:	299	105.66	73%	52.77	74%	725.27	75%
conv3 backward:	598	228.13	68%	114.76	68%	1562.49	69%
conv4 forward:	224	81.77	71%	40.92	71%	547.34	74%
conv4 backward:	448	176.46	66%	88.66	66%	1183.89	69%
conv5 forward:	150	56.16	69%	28.06	69%	368.19	74%
conv5 backward:	300	120.98	64%	60.74	64%	797.17	68%
fc6 forward:	75	23.39	84%	13.01	75%	137.74	100%
fc6 backward:	75	41.84	47%	22.64	43%	264.57	52%
fc7 forward:	34	11.14	79%	5.99	74%	65.28	95%
fc7 backward:	34	18.95	47%	10.08	44%	118.79	52%
fc8 forward:	8	2.59	80%	1.87	56%	15.92	91%
fc8 backward:	8	5.26	39%	3.1	34%	28.83	50%
Conv+fc Forward	1449	578.04	65%	316.96	59%	3593.12	73%
Conv+fc Backward	2781	1109.65	65%	565.17	64%	7432.99	68%
Conv+fc F/B:	4231	1687.69	51%	882.13	46%	11026.11	63%

Oct 28 '15 20:10 ozabluda

Just quick update. Recently we released technical preview of Multinode Caffe. The link: https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems-based-on-intel-xeon-processor-e5

The results are shown for Alexnet. We use data parallelism (for the first half of the net: from data till pool5) as well as model parallelism (for the second half: from fc6 till the end). The behavior of Multinode Caffe almost duplicates the behavior of Singlenode Caffe. This puts some limitations on scalability. Though we were able to achieve 12.3x, 19.2x and 29.4x speed-up on 16, 32 and 64 nodes respectively.

Nov 02 '15 15:11 emfomenk

@emfomenk, thank you for the summary. Sorry, I don't understand what you mean by

The behavior of Multinode Caffe almost duplicates the behavior of Singlenode Caffe. This puts some limitations on scalability.

I also don't understand from the article what the effective minibatch is for, say, 64 nodes. Is is still 256 i.e. 4 per node? For multinode syncronous SGD, it's probably best to switch to the 1-col AlexNet from the "One weird Trick..." paper and follow the paper.

Nov 03 '15 18:11 ozabluda

@ozabluda, I mean the algorithm in Multinode Caffe (underlying math) is the same as in Singlenode Caffe: Forward, Backward, SGD, the same parameters and so on. In particular it means that there is no much possibilities to parallelize the work.

The only difference in Multinode version (from math point of view) is slightly modified SGD solver, which allows to apply diff right after backward step for current layer (this was made to be able to benefit from MPI parallelization in current approach). It looks like this modification doesn't affect convergence -- at least we were able to train Alexnet in the same amount of iterations as in Singlenode case.

Regarding minibatch: for 16 nodes minibatch=256 was used, for 32 nodes minibatch=512, and for 64 nodes minibatch=1024. It means that each node (in 16 nodes case) took 256/16=16 images in its "local" minibatch.

Yes, you are right that there are much better ways to implement multinode training (though, the math would be slightly different...), but the original idea was just to show that it possible to implement good parallelization even for this particular model.

Nov 04 '15 15:11 emfomenk

I mean the algorithm in Multinode Caffe (underlying math) is the same as in Singlenode Caffe: Forward, Backward, SGD, the same parameters and so on. In particular it means that there is no much possibilities to parallelize the work.

I see. Does it mean it is approximately the same as single-node multi-GPU Caffe? What about parameter update step? Is it centralized, or also distributed, just single-node multi-GPU Caffe?

Article says """reached 80% top-5 accuracy in just over 5 hours on a 64-node""". Is that 90 epochs with minibatch=1024? AlexNet from the original paper reached 81.8% after 90 epochs with minbatch=128.

P.S. Graph incorrectly says "E5-2697 v3 18 cores"

Nov 04 '15 20:11 ozabluda

Article says """reached 80% top-5 accuracy in just over 5 hours on a 64-node""". Is that 90 epochs with minibatch=1024? AlexNet from the original paper reached 81.8% after 90 epochs with minbatch=128.

Correction: 81.8% top-5 from the paper was with averaging predictions of 5 crops plus their horizontal reflections. Standard Caffe "test" does 1 random crop with no reflections, for which 80.2-80.4% top-5 is reached in 60-70 epochs, depending. How many epochs was it with minibatch=1024?

Nov 05 '15 17:11 ozabluda

@ozabluda, sorry, but i am not familiar with multi-GPU Caffe. I need to look at the codes.

In Multinode Caffe for the first half of the net the parameter updates are centralized (since parallelization happens on minibatch, all convolutions parameters are the same for all nodes). For the second half updates are distributed, since fully-connected layers' weights are distributed across the nodes.

Just to be aligned: one epoch == one full database turn around. We always ran Caffe (singlenode and multinode versions) for 90 epochs (this number was just fixed). We saw that accuracy didn't improved much since ~40-50 epoch, but I didn't save intermediate snapshots and can't say for sure the accuracy after 60 or 70 epochs right now. If you want I can rerun the training and report the top-5 accuracy for these epoch numbers.

Nov 05 '15 17:11 emfomenk

We always ran Caffe (singlenode and multinode versions) for 90 epochs

Great. I think the web article should say that explicitly, especially since it is actually faster than what could be guessed from """reached 80% top-5 accuracy""", which can mean as little as 40, as you noticed:

We saw that accuracy didn't improved much since ~40-50 epoch [...] If you want I can rerun the training and report the top-5 accuracy for these epoch numbers.

Thank you for the offer, knowing that it's 90 epochs is good enough for me.

Off-topic part:

I am actually more interested in the number more precise than 80% (precision like 80.xx% would be better) for minibatch=1024 [1], single model, single crop, top-5 and top-1 (Caffe can do both simultaneously). I am also interested your ultimate accuracy for minibatch=256,512 as well. As you noticed, with the growing number of nodes you have to increase minibatch size, which negatively affects accuracy.

[1] BTW, did you increase learning rate 4x, compared to minibatch=256? If yes, how did that affect accuracy? How about increasing learning rate sqtr(4)=2x?

Nov 05 '15 21:11 ozabluda

This somewhat explains how Intel's Multi-node Caffe works https://github.com/BVLC/caffe/pull/3252

Nov 10 '15 01:11 ozabluda

Please take a look at https://communities.intel.com/community/itpeernetwork/datastack/blog/2015/11/12/myth-busted-general-purpose-cpus-can-t-tackle-deep-neural-network-training-part-2 for more information on technical details of Intel Multinode Caffe tech-preview, which actually uses one weird trick... :) There is more on technical side. Unfortunately we didn't play with learning rate and it was always the same (the default one from bvlc_alextnet/sover.prototxt).

Nov 17 '15 17:11 emfomenk

Speaking about accuracy, this could be used as baseline: https://github.com/BVLC/caffe/wiki/Models-accuracy-on-ImageNet-2012-val

Nov 19 '15 15:11 ducha-aiki

There is now an official Intel Opencl PR at https://github.com/BVLC/caffe/pull/3355. /cc @gongzg

Nov 21 '15 10:11 bhack

@scott-gray>Yes this is F(2x2,3x3). [...] I'm able to fit this all in one block for K=32 and 4 overlapping coordinates of x,y each with with 8 units of minibatch. [...] The in block overlap is key as that's what gives you such a high L1 hit rate, otherwise the you'd be bandwidth bound on L2.

With F(2x2,3x3), (super)block 2x2 we have tile size of 6x6. In two other dimensions the tile size is K32xN8. Outer loop is over input channels (C). With 4-byte fp32 each 6x6 (super)block (=tile) we have:

Filters: 32_3_3_4=1152 bytes Input: 6_6_8_4=1152 bytes Output: 4_4_32_8_4=16384 bytes

1 loop (128 threads) does the image transform inline and the other (128 threads) the filter transform (256 threads total). I can fit two blocks on an SM to cover bar.sync latencies.

Do I understand correctly that Filters and Input go to L1 (24 KB per SM) and output is accumulated in the registers (64k 32-bit registers per SM)? Do you use Shared Memory (96 KB per SM) at all? What limits it to two blocks on an SM?

I clarified the tiling I was using in a later post here. It's actually 32x32, not 32x8. The 32x8 is what is visible to the user, but 32x32 is how it actually works. The outer product dims of the batched gemm are K and Y/4X/4N. So I don't just have 8 points of N on the outer product, but 4 sets x,y coordinates of 8 points of N arranged in a 2x2 superblock. With the 2 units of overlap in each direction, this hugely increases the utilization of the L1 cache and its what makes it possible for this kernel to have such dense global loads (16 loads in ~256 cycles is a lot).

Do I understand correctly that the 4 thread blocks (256 threads each) that work on the same 2x2 superblock, really know nothing about each other, solely relying on L1 for transparent data reuse?

Jul 07 '16 20:07 ozabluda

You can find the latest code for F(2x2,3x3) here:

https://github.com/NervanaSystems/neon/blob/master/neon/backends/kernels/sass/xconv_winograd_2x2_3x3_32x32.sass

This kernel uses 256 threads, 128 registers and 32kb shared memory. This means the threads and registers are limiting the occupancy to 2 blocks per SM and 4 warps per scheduler.

The shared memory is mainly used for storing the computed transforms and facilitating the batched gemm. The gemm tile is 32x32 and we have 16 of them in the same block. This means we only have enough shared memory to store 4 outer product lines at a time, double buffered. So the gemm loops are unrolled 4 times. We use 2 separate loops to compute the image and filter transforms inline.

When super blocking is in effect, you can get a lot of L1 cache hits, reducing the bandwidth from L2.

This implementation is currently significantly more efficient than the one found in cuDNN 5.0 and up.

Jul 07 '16 20:07 scott-gray

convnet-benchmarks
convnet-benchmarks copied to clipboard

[October 2015] Intel are CPU magicians. But there's no one weird trick....

Thank you for the offer, knowing that it's 90 epochs is good enough for me.

convnet-benchmarks convnet-benchmarks copied to clipboard

[October 2015] Intel are CPU magicians. But there's no one weird trick....

Thank you for the offer, knowing that it's 90 epochs is good enough for me.

convnet-benchmarks
convnet-benchmarks copied to clipboard