[FEA] Additional improvements to separable convolution

Open grlee77 opened this issue 3 years ago • 1 comments

Ideas for further improvements to separable convolution

Listing some thoughts for further improvements on top of #355.

reduce redundant data fetching

It would be nice to adapt the code from #355 to more efficiently handle the case where multiple separable filters are to be applied. For example, for gaussian we would typically filter each axis in turn. Currently the implementation handles one axis at a time and the code uses two stages: 1.) load a tile into shared memory 2.) perform the convolution on the tile

Thus when doing separable filters in nD cases, stage 1 is repeated n times. It would be preferable to adapt to a version that can do a tiled load just once (taking into account any boundary padding along all dimensions) and then once the shared memory is available, apply each filter in turn along the axes.

Accept multiple filters and outputs.

We could potentially also accelerating other operations such as gradient and hessian filters where a series of small separable filters is applied in various combinations to create multiple outputs. This is another case where it seems like we should be able to just do 1 load of a tile into shared memory and then apply all filters to it.

Extend to multi-channel image support

Right now the kernels only run on scalar pixel types, but we can potentially adapt to float2, float3, float4 or arbitrary numbers of channels

Extend to nD

The 3D case could be adapted to a general nD implementation. We can only use up to 3 grid/block dimensions in CUDA, but could ravel multiple physical dimensions into one of these.

Aug 03 '22 16:08 grlee77

Quote from @crisluengo ( https://github.com/rapidsai/cucim/pull/355#issuecomment-1206839904 ):

@grlee77 I haven't been able to read all of the code, but I did a search for "sym" and didn't find anything.

It is usually worth while to make a special case for symmetric filters. For example, instead of (pseudocode) img[i] * filter[i] + img[-i] * filter[-i] (where filter[i] == filter[-i]), you would do (img[i] + img[-i]) * filter[-i], saving about half the multiplications.

In DIPlib we have a special case for both the symmetric case and the antisymmetric case (where filter[i] == -filter[-i], e.g. derivative filters).

Think we will want to capture this with other improvements already noted in this issue.

Aug 05 '22 21:08 jakirkham