opacus Alternative grad sample algorithm for Conv

Implementation of convolution backward with a convolution. Original implementation due to @zou3519 (https://gist.github.com/zou3519/080f3a296f190ea1730d97396d5267d6).

The original code has been extended to handle the general case (i.e., groups, dilation and stride).

There is still one minor problem that I couldn't find a nice solution to: in some cases, the backward will produce a grad sample that is slightly bigger than the correct one (e.g. kernel of size 3 with stride 2 and input of size 6). The current solution is to just ignore the last dimensions (line 52 in grad_sample/conv.py)

Nov 05 '21 12:11 alexandresablayrolles

There is still one minor problem that I couldn't find a nice solution to: in some cases, the backward will produce a grad sample that is slightly bigger than the correct one (e.g. kernel of size 3 with stride 2 and input of size 6). The current solution is to just ignore the last dimensions (line 52 in grad_sample/conv.py)

@samdow has an implementation of this (a per-sample-grad rule for conv that uses conv) in her per-sample-gradients prototype. Maybe she can shed some light on this edge case?

Nov 05 '21 14:11 zou3519

You guys totally have the right solution to drop from the right or bottom anything that's too big. Here's the reasoning I have for it and let me know if it doesn't makes sense or if it would help to draw some pictures:

For strides larger than 1, there's a range of kernel sizes will give us the same output size. This is because the kernel doesn't have enough space to tile another matmul.

For instance, in the example you give of kernel size of 3, strides of 2, input size of 6 (assuming no padding and single group 😃 ), we run one matmul where the leftmost corner of the kernel is at (0,0) bottom right at (2,2), one at (0,2), bottom right at (2,4). We can't run another because it would be at (0,4) and the kernel wouldn't fit on the image. We would do the same as we go down another column, but we'll be doing 2 matmuls per row.

The thing to notice here is that we would end up doing 2 matmuls per row again with a kernel of size 4. One would be with the top left corner of the kernel at (0,0), bottom right at (3,3), another with the top right at (0,2), bottom right at (2,5). So in both cases our output size from the convolution is [2,2].

When we're doing the backwards pass, both of the different kernel sizes will be seeing the same input (an output of size 2, an input of size 6, strides of size 2, no padding, single group). So, we're computing the derivative with respect to the kernel as if it were of the largest possible kernel size that would create these outputs. In the case where the kernel size is smaller than the actual possible size, those extra elements on the right have no real meaning, they're just garbage, because that kernel element didn't exist initially, so nothing was used in the computation of the convolution output element that we're looking at.

An important insight here is that the kernel gradient before dropping the rightmost elements will be at most (strides-1) larger than the actual kernel size

Also a small note unrelated to this: We found empirically that there was a batch size threshold at which the unfold was faster (in the examples we tried, it was about 256). I know this is experimental, but it might be something interesting for you all to be on the lookout for!

Nov 05 '21 14:11 samdow

@samdow thanks for the detailed explanation, that's also what I thought, it seems that there is no smarter way to avoid computation so it's probably good to keep as is. I also ran a quick benchmark and found that for Resnet18 it lead to a significant decrease in speed (around 3x). I'm not sure what's going on, my main guess would be that it's an overkill to do a "convolution" where we only consider two large maps (activations and backprops) and few "offsets" (i.e. ±1 in H and W).

Nov 08 '21 14:11 alexandresablayrolles

@ffuuugor I'm thinking of committing this as an alternative grad_sampler, but keeping the unfold-based as the default one. Do we have a clean way to support multiple grad_samplers? Or should I not commit to avoid dead code?

Nov 15 '21 20:11 alexandresablayrolles

Is there any benefit other than code clarity? Like memory or support for higher dimensions?

If yes, I'd say it's fine to leave it as alternative grad_sampler - not wrap in register_grad_sampler, but allow people to register if they want certain trade-offs. Otherwise, I think it'll be really good material for the future post (the whole saga with multiple attempts and benchmarking), but I'd vote against keeping the code just for historical purposes

Nov 15 '21 21:11 ffuuugor

I'd vote against keeping the code just for historical purposes

Isn't Opacus-lab the right place for this?

Nov 17 '21 22:11 romovpa

@alexandresablayrolles Could you please sum up and close this PR with the corresponding issue #145?

Jan 20 '22 01:01 romovpa

Decided to keep both algorithms, the original one and this new implementation:

Both are useful in different cases.
This can be a good demo of using custom grad sampler and should probably be covered in the docs

@alexandresablayrolles to finish this change

Feb 09 '22 17:02 romovpa