cudnn.torch Inconsistencies with nn

Inconsistencies with nn

Open szagoruyko opened this issue 8 years ago • 13 comments

Let's track them here:

There is no nn.SpatialLogSoftMax, should be addressed by https://github.com/torch/nn/pull/560
There is no nn.SpatialCrossEntropyCriterion ~~(and cudnn test is broken)~~
nn.TemporalConvolution does not have padH support and the current implementation of cudnn.TemporalConvolution needs modifications to support cudnn.convert in R4
~~nn.SpatialBatchNormalization does not support 5D inputs in R4~~
~~nn.SpatialConvolution and cudnn.SpatialConvolution in R3 does not support noBias() (will cause error on conversion)~~
nn.SpatialConvolution does not support groups (will cause error on cudnn.convert cudnn -> nn)

Jan 26 '16 19:01 szagoruyko

~~nn.SpatialBatchNormalization has running_var, cudnn.SpatialBatchNormalization has running_std~~ fixed in R5

Jan 31 '16 15:01 szagoruyko

@szagoruyko nn.SpatialConvolution now supports :noBias()

Apr 23 '16 12:04 apaszke

@apaszke thanks, updated the comments

Apr 23 '16 18:04 szagoruyko

We have observed that converting using cudnn.convert doesn't work for all modules, for example cudnn.ClippedReLU doesn't get translated into nn despite mention of API compatibility.

Jul 18 '16 03:07 adroit91

@adroit91 we could convert ClippedReLU to HardTanh @ibmua should be easy to implement groups with THNN, a simple for loop I think? @fmassa

Jul 30 '16 22:07 szagoruyko

Hi @ibmua . You misunderstand the purpose of nn.Parallel. It is not parallel compute, but it is a container pattern that executes parallel branches. It wont be faster, or use CPU threads...

Aug 03 '16 02:08 soumith

@ibmua wrt the performance variance as you change number of groups, CPU / GPU performance is not always linear wrt the amount of compute. If you have very little work, it also does not use the GPU compute fully, which is for example what i suspect is happening in the groups=2 vs groups=4 case :)

Aug 03 '16 16:08 soumith

So I've been trying to research the grouped theme, but just found out today that these guys already went deep into this thing with a lot of hardware. https://arxiv.org/pdf/1605.06489v1.pdf Proves high importance of groups, especially on a CPU. I'm sure they'd get a comparable actual speedup on GPUs, my guess is that it's not that large only because CuDNN's implementation of them is crappy. I've made a plain non-winograd kernel for fully-grouped forward that was ~20x faster, at least on many tests I've tried, than CuDNN v5.1. It's just not optimized for groups, especially large ones. My bet is that their actual CPU speedup is also modest compared to what's possible.

I wanted to write the kernel and stuff for Torch, but the data structure is almost completely undocumented and from other source code I can't get the heck of what's being done. The code is a define upon a define and where it's defined is completely undefined, the whole thing is just a mess that's impossible to comprehend.

Edit: Oh, so I've viewed the code https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua and I see you're actually simulating grouped convolution ability by launching kernels consecutively and they're not actually a part of CuDNN. That explains.

I wonder what sense does it make for NVidia to close-source CuDNN. Can't see any sanity in that.

Edit2: Interesting to note that grouped convs are also a volumetric local pooling.

Sep 15 '16 23:09 ibmua

Okay, now that ResNeXt is out https://arxiv.org/pdf/1611.05431.pdf , I'm hoping that I'm not the only one who understands importance of native grouped convolutions here? Since groups are exactly the only thing added vs older ResNet. And it's not groups=2, groups=4, it's groups=32 and the likes. Current codebase is totally unsuitable.

Feb 26 '17 00:02 ibmua

https://arxiv.org/pdf/1611.05431.pdf Performance. For simplicity we use Torch’s built-in grouped convolution implementation, without special optimization. We note that this implementation was brute-force and not parallelization-friendly. On 8 GPUs of NVIDIA M40, training 32×4d ResNeXt-101 in Table 3 takes 0.95s per mini-batch, vs. 0.70s of ResNet-101 baseline that has similar FLOPs. We argue that this is a reasonable overhead. We expect carefully engineered lower-level implementation (e.g., in CUDA) will reduce this overhead. We also expect that the inference time on CPUs will present less overhead. Training the 2×complexity model (64×4d ResNeXt-101) takes 1.7s per mini-batch and 10 days total on 8 GPUs.

A definite knock on your door.

Really flatters me that I've been researching the very same concept as Kaiming & co throughout Aug-Sept. Couldn't come up with optimal structure though, while I've tried plenty, partly, because I don't have such a shitton of hardware, nor there is any framework with adequate implementation of grouped convs to be able to try out different stuff on my much limited, comparably - 2xGPUs in total - hardware (tried on CIFAR only, of course, no way I could run ImageNet). =( I wonder how many failed attempts with slightly different structures they've had along the way. =) And I totally wonder why they didn't hire some Scott Gray to implement grouped convs which would probably cost less than the additional processing power did. I wanted to implement the thing myself on a CUDA level without Winograd opt., learned CUDA even for that very purpose, but all of the existing frameworks turned out too opaque to me to potentially integrate any code. Also, I remember, some were probably not very CUDA-friendly in terms of the way the data was formatted in them, I think Torch was one of those. Scott can, probably, overcome that problem, as I recall he wanted to write some fast kernels for very small batches, which might have a common solution with these problems of data requesting from GPU's RAM.

Feb 26 '17 00:02 ibmua

Actually, taking a closer look, Kaiming's paper doesn't have a lot of novelty vs https://arxiv.org/pdf/1605.06489v1.pdf which I've already linked to, basically it's a follow-up on that study, more of a confirmation on the subject. I'm guessing there's quite some room for improvement, I'm very unsure if his 1->3->1 blocks are actually optimal, since 1x1 convs are extremely GPU RAM throughput-hungry and also having more channels consumes a lot more memory. Basically, while I was researching this very same thing I considered those implications and for these reasons was quite discouraged myself. Kaiming on the other hand tries to completely ignore that issue in his paper also ignoring the fact that what he's comparing is in fact a wider resnet with groups to a narrower one without. Not his best paper, IMHO. But still, proved a point that fast group convs are completely necessary.

Feb 26 '17 19:02 ibmua

NVidia said they're planning to release some implementation of groups in their next CuDNN.

Feb 28 '17 19:02 ibmua

https://developer.nvidia.com/cudnn so grouped convs are now available in CuDNN v7.

Aug 16 '17 03:08 ibmua

cudnn.torch cudnn.torch copied to clipboard

Inconsistencies with nn

cudnn.torch
cudnn.torch copied to clipboard