cub RFE: Add device-wide segmented scan primitives

Using CUB device-wide primitives, I wrote an implementation of the algorithm described in "Efﬁcient Projections onto the ℓ₁-Ball for Learning in High Dimensions" by John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra for computing the Euclidean projection of a vector onto the ℓ₁-ball of a given radius.

A reference implementation of the algorithm written in MATLAB by John Duchi is available at: https://web.stanford.edu/~jduchi/projects/DuchiShSiCh08/ProjectOntoL1Ball.m

One of the steps of the algorithm is to compute a cumulative sum (inclusive-sum) of a vector:

sv = cumsum(u);

If there is only one vector, this can be accomplished using cub::DeviceScan::InclusiveSum(). For multiple vectors, it would be nice to have available a device-wide segmented inclusive sum operation.

Nov 17 '16 17:11 ghost

Have you considered thrust::inclusive_scan_by_key ? [1]

[1] https://thrust.github.io/doc/group__segmentedprefixsums.html

Nov 17 '16 18:11 3gx

One reason for wanting a CUB primitive is that it would probably be much faster than Thrust. If you look at the performance comparison of CUB v. Thrust on the cub::DeviceScan reference page, it shows that CUB is much faster than Thrust's prefix sum operations.

Nov 17 '16 20:11 ghost

The interface of CUB is also much nicer with respect to memory management.

I second this request as well.

Jun 07 '17 22:06 ekelsen

What would your preferred interface be? An auxiliary array of dense segment offsets? A pair of auxiliary arrays (one for segment-begins, the other for segment-ends, which could be aliased to the same array)? A companion array of flags?

Jun 08 '17 02:06 dumerrill

I would be happy with the same interface as for the segmented reductions - a pair of arrays that could be aliased to the same one.

Jun 08 '17 03:06 ekelsen

+1; @ekelsen and I are both very interested in this from the TensorFlow side.

Sep 29 '17 17:09 ebrevdo

And to throw one more thing on the pile, it would be great if there was at least the option to get deterministic results, even for pseudo-associative operations. (Maybe that would require the two pass algorithm instead of the one pass, but scan is rarely a performance limiting step in numerical calculations, so the option to lose some performance for reproducibility would be welcome.)

Sep 29 '17 17:09 ekelsen