cub
cub copied to clipboard
RFE: Add device-wide segmented scan primitives
Using CUB device-wide primitives, I wrote an implementation of the algorithm described in "Efficient Projections onto the ℓ1-Ball for Learning in High Dimensions" by John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra for computing the Euclidean projection of a vector onto the ℓ1-ball of a given radius.
A reference implementation of the algorithm written in MATLAB by John Duchi is available at: https://web.stanford.edu/~jduchi/projects/DuchiShSiCh08/ProjectOntoL1Ball.m
One of the steps of the algorithm is to compute a cumulative sum (inclusive-sum) of a vector:
sv = cumsum(u);
If there is only one vector, this can be accomplished using cub::DeviceScan::InclusiveSum(). For multiple vectors, it would be nice to have available a device-wide segmented inclusive sum operation.
Have you considered thrust::inclusive_scan_by_key ? [1]
[1] https://thrust.github.io/doc/group__segmentedprefixsums.html
One reason for wanting a CUB primitive is that it would probably be much faster than Thrust. If you look at the performance comparison of CUB v. Thrust on the cub::DeviceScan
reference page, it shows that CUB is much faster than Thrust's prefix sum operations.
The interface of CUB is also much nicer with respect to memory management.
I second this request as well.
What would your preferred interface be? An auxiliary array of dense segment offsets? A pair of auxiliary arrays (one for segment-begins, the other for segment-ends, which could be aliased to the same array)? A companion array of flags?
I would be happy with the same interface as for the segmented reductions - a pair of arrays that could be aliased to the same one.
+1; @ekelsen and I are both very interested in this from the TensorFlow side.
And to throw one more thing on the pile, it would be great if there was at least the option to get deterministic results, even for pseudo-associative operations. (Maybe that would require the two pass algorithm instead of the one pass, but scan is rarely a performance limiting step in numerical calculations, so the option to lose some performance for reproducibility would be welcome.)