[FEA] Further improvements to histogram-based median
A histogram-based median for 2D 8 and 16-bit integer images was implemented in #317. There are some areas where this could be further improved in the future.
improved boundary handling
The histogram-based median proposed in #317 currently requires some edge padding to handle all of the boundary conditions available. This padding prevents being able to directly write into a user-provided output array from the kernel. Ideally we would fully implement the boundary handling in the kernels to avoid the overhead of pre-padding before the kernel and cropping afterward.
extension to 3D data
It should be possible to extend this concept to 3D, but more care to stay within GPU memory limits will be required. I think one could store histograms on a per-plane rather than per-line basis in that case to reduce the number of histogram copies that need to be kept in-memory.
Use CUB for Histogram operations
Look into using CUB's BlockScan or similar for the histogram median lookups
Extend testing with 16-bit ranges beyond 10-bit depth
Locally 12-bit range (e.g. [0, 4096] as common in DICOM) works, but testing with a range of values larger than 1024 fails on remote tests. We need to investigate why this is.
Also, I don't think we currently use them anywhere in cuCIM, but this same implementation from #317 can also be used to accelerate CuPy's percentile_filter or rank_filter just by changing the median_pos variable passed to a kernel to a position that corresponds to the desired rank.