cccl [FEA]: Add a new shuffle iterator to thrust

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

Thrust

Is your feature request related to a problem? Please describe.

I would like to be able to generate a (random access) thrust shuffle "iterator" type that allows me to lazily compute shuffle indices. This is useful for several different applications:

Random sampling, we only need the first N shuffle indices
Distributed shuffle, each node only computes the shuffle indices for it's elements
Using with other iterator types. e.g. permutation_iterator

Describe the solution you'd like

I would like a make_shuffle_iterator function. This can be implemented as a composition of:

counting iterator
transform iterator
the existing feistel_bijection used internally for thrust::shuffle
a new iterate_until_in_range functor

template<class URNG>
auto make_shuffle_iterator(size_t num_elements, URNG&& g) {
  return thrust::transform_iterator(thrust::counting_iterator(0ull), iterate_until_in_range(num_elements, feistel_bijection(num_elements, g)));
}

And iterate_until_in_range can be implemented as follows:

struct iterate_until_in_range {
  size_t num_elements;
  feistel_bijection bijection;

  size_t operator()(size_t x) {
    do { x = bijection(x); } while(x >= num_elements);
    return x;
  }
}

Describe alternatives you've considered

Alternatively, we could just expose the feistel_bijection or iterate_until_in_range (with a name like random_bijection) to allow users to implement this composition themselves.

Additional context

No response

Apr 23 '25 02:04 djns99

I am happy to implement this myself, but I would appreciate some design guidance on what the most suitable API would be. I couldn't find anything already in thrust with quite the same behaviour

Apr 23 '25 02:04 djns99

Hi @djns99 thanks a lot for proposing this feature, I do happen to often want something like that for tests.

However, I see some potential pitfalls:

c++ commonly assumes that operator[] on a random access iterator has constant amortized cost, which would not be the case here
I am worried about multiple threads incrementing the same iterator and hammering the RNG
I believe this is something that is highly dependent on the explicit use case. How many elements do I want. if there are only few precomputing might be beneficial, my guestimate would be something around 10 - 20 elements would be better off with inplace storage of the indices.

Nothing here is blocking but something to keep in mind

I would like to get the opinion of the other maintainers

Apr 23 '25 08:04 miscco

Hi @miscco, Actually this function does have a constant cost, each element is entirely independent and can be computed in any order. The RNG is only invoked once on construction and not used during the iteration either.

Apr 23 '25 08:04 djns99

I haven't done the formal analysis, in the worst case one call to iterate_until_in_range can take O(n) time. However, for each iteration it can be roughly modelled as a (worst case) 1/2 chance of needing to iterate another time. So the chance of doing x iterations is proportional to 1/2^x for any given element. (Assuming a high quality bijection which I think we have)

Apr 23 '25 08:04 djns99

Hi @djns99

Thanks a lot for expressing interest in contributing a shuffle iterator! This would definitely be a very welcome contribution! As Michael mentioned, I mostly have testing scenarios in mind.

I only have a very vague understanding of Feistel Networks and found its property of being able to generate a random bijective mapping for power-of-two sizes very attractive.

Apr 23 '25 10:04 elstehle

To elaborate on how this works, feistel_bijection works on a power of two. To generalize this to a non power of two, we round up to a power of two and then use iterate_until_in_range, which simply calls the bijection again with its result if it is outside of the desired range.

The basic proof is this:

We have the target range [0, n) and the domain [0, 2^m) with n <= 2^m. A bijection has two properties: Every input value has a unique output (injective), and every output can be reached (surjective). Because of these properties if you repeatedly chain a bijective function with its previous output then it must eventually form a cycle starting with your original value. We can thus make the following statements:

Since the original value is always in the target range, this means we will always terminate will a value in range (i.e. in the worst case cycling back to the original value)
This will not make any duplicates because no intermediate value on the chain can be reached by starting in the target range, without first going through the original value which would be in range and end the chain

While that's probably not the clearest explanation, the point is we can show that this is still a bijection, now on the range [0, n).

The cons:

The concern here (which is why we used an inclusive scan when we implemented thrust::shuffle) is that in the worst case the number of iterations for a single element could be 2^m - n (which is at most n - 1 if you round to nearest power 2). So in the case of thrust::shuffle we did not accept this non-determinism, because performance with the inclusive scan was still entirely limited by moving the actual array elements. In the usual case this cost is amortized across all elements (e.g. every element does ~2 iterations)

The pros

It is exceedingly unlikely that for a large n that there will be a single cycle that traverses all the out of range elements without landing in range earlier. My comment above about 1/2^x is an upper bound on that probability, the actual probability should be smaller still. The upside of this approach is now each value can be computed entirely independent with no communication/dependencies.

So in the case of the iterator where the goal is to be able to compute the result lazily I think it is perfectly fine to use this approach.

Hopefully this helps clear it up

Apr 23 '25 10:04 djns99