dlib Add support for grouped convolutions

In this PR I will try to add support for grouped convolutions in dlib. I never had any interest/use for this kind of convolutions until yesterday, when I read this paper: A ConvNet for the 2020s. The paper explores the main additions in Transformer networks and adds them to a convolutional network. It makes use of recent additions to dlib:

https://github.com/davisking/dlib/pull/2204
https://github.com/davisking/dlib/pull/2213

Unfortunately, it makes also use of grouped convolutions, which are not currently supported in dlib. That was the motivation I needed. So far I've written:

[x] forward gpu
[x] backward gpu
[ ] forward cpu
[ ] backward cpu

The gpu part is relatively easy, since it's just a matter of using the CuDNN API. The cpu part might take longer to complete (I don't think I'll ever use it, but I will try to add it for completeness.)

I've already implemented the ConvNeXt models described in the paper, and the forward pass seems to work. Let me know if the approach is sensible.

Jan 13 '22 07:01 arrufat

Yeah that paper is nice. I like these new simple building blocks. I hope they do actually work well beyond simple resnet modules.

Jan 16 '22 18:01 pfeatherstone

Warning: this issue has been inactive for 35 days and will be automatically closed on 2022-03-02 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

Feb 21 '22 09:02 dlib-issue-bot

I got carried away with other stuff, I will finish this at some point, though :sweat_smile:

Feb 21 '22 09:02 arrufat

Can this be implemented in the toeplitz matrix ?

Mar 15 '22 17:03 pfeatherstone

Can this be implemented in the toeplitz matrix ?

I was planning to get inspiration from here. But feel free to step in, I don't think I will find time to do this soon, sadly enough :(

Mar 16 '22 00:03 arrufat

I have a cpu implementation but I have to go through some company bureaucracy to be allowed to upload it.

Mar 25 '22 11:03 rTreutlein

Just noticed the toeplitz matrix isn't cached in the tensor_conv class. So there are a lot of allocations and deallocations. Right?

Mar 29 '22 14:03 pfeatherstone

Just noticed the toeplitz matrix isn't cached in the tensor_conv class. So there are a lot of allocations and deallocations. Right?

There are, but there are other allocations and deallocations too. I doubt that one makes a meaningful difference to runtime speed considering what all goes on if someone is using this kind of model.

Mar 30 '22 23:03 davisking

@davisking OK, I trust you. I had a quick look at the ncnn repo, and they have like a 100 specialisations of the convolution layer for different parameters like kernel sizes, groups, etc, whether or not the layer is 8-bit quantized, different architectures,... It looks like way too much work to do something similar in dlib to get CPU conv performance up to standard. Sorry this is unrelated to this PR. Just passing observation.

Mar 31 '22 07:03 pfeatherstone

Yeah that's how you do conv on the CPU fast. The toeplitz matrix thing is a weird hack. I did it (as did others) because it's just a super easy way to support all the various conv types with all their different strides and all that. But fast conv code looks like this kind of stuff https://github.com/davisking/dlib/blob/master/dlib/image_transforms/spatial_filtering.h#L126. Or other similar kinds of setups. It depends on which kind of conv we are talking about.

Apr 07 '22 02:04 davisking