Fuse segments partitioning with sorting of small segments
I've found a corner case in the cub::DeviceSegmentedSort that can be optimized. If input data contains a lot of unit segments, the performance isn't optimal since we have to read segment offsets twice and since these segments are unit, the offsets can exceed the data size of these segments. This is the case for the following matrix:

It contains 226,196,185 rows with one item each (median).
| seg rad sort | seg sort | ext sort | |
|---|---|---|---|
| elapsed | 46 | 0.63 | 0.042 |
The extended sort above consists in using cub::DeviceSort on extended_key = ((segment_id << 32) + key). In general this approach is about thee times slower than cub::DeviceSegmentedSort. But in the case above it's much faster. The cub::DeviceSegmentedSort is 73 times faster than the cub::DeviceSegmentedRadixSort. The cub::DeviceSort with extended keys is 15 times faster than the cub::DeviceSegmentedSort.
The initial idea would be to fuse a three-way partitioning kernel with sorting of small segments (cub::StableOddEvenSort should be fine), just like in the BatchedMemcpy PR. In this case, we won't let further kernels process segments with up to 2 or so items.