cccl [FEA]: Tabulate Output Iterator

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

Thrust

Is your feature request related to a problem? Please describe.

There's an open question if Kokkos scan API can be implemented in Thrust/CUB. Kokkos provides operator with is_final parameter and an index in input sequence. This allows implementing prefix sum algorithms that pack data into multiple arrays based on runtime index. This is somewhat similar to N-way partition, but is not limited to it.

Describe the solution you'd like

I believe this functionality can be implemented with a new output iterator on Thrust end. This iterator should provide a binary operator with an index in the output sequence and an incoming value:

auto it = make_tabulate_output_iterator([](int index, double value) { print(index, value); });

*(it + 1) = 42; // prints 1, 42
*(it + 2) = 43; // prints 2, 43

This iterator might be used to implement N-way partition. After discussing this with @elstehle, it might also be helpful in our tests of large problem sizes.

Describe alternatives you've considered

No response

Additional context

No response

Mar 05 '24 09:03 gevtushenko

auto it = make_tabulate_output_iterator([](int index, double value) { print(index, value); });
*(it + 1) = 42; // prints 1, 42
*(it + 2) = 43; // prints 2, 43

I understand this example, but I don't grok the connection to the Kokkos scan. Can you give a sketch of how one would use this with scan to do something you can't do today?

Mar 05 '24 13:03 jrhemstad

@jrhemstad here's a use case for is_final that I had in mind:

constexpr int N = 3;
using offsets_t = cuda::std::array<int, N>;
using pointers_t = cuda::std::array<int*, N>;

__host__ __device__ int pack_index(int index) {
  // Some data-dependent computation here
  return index % N; 
}

struct output_op_t {
  int n;
  int *input;
  offsets_t *counts;
  pointers_t pointers;

  __host__ __device__ void operator()(int index, offsets_t offsets) {
    pointers[pack_index(index)][offsets[pack_index(index)]] = input[index];

    if (index == n - 1) {
      counts[0] = offsets;
    }
  }
};

int main() {
  const int n = 8;
  pointers_t pointers;
  thrust::device_vector<int> vi = { 7, 6, 5, 4, 3, 2, 1, 0 };
  cuda::std::array<thrust::device_vector<int>, N> packs;
  thrust::device_vector<offsets_t> counter(1);
  int *pi = thrust::raw_pointer_cast(vi.data());
  offsets_t *pc = thrust::raw_pointer_cast(counter.data());
  for (int i = 0; i < N; i++) {
    packs[i].resize(n);
    pointers[i] = thrust::raw_pointer_cast(packs[i].data());
  }

  auto input_iterator = thrust::make_transform_iterator(thrust::make_counting_iterator(0),
                                                        [] __host__ __device__(int index) {
                                                          offsets_t offsets{};
                                                          offsets[pack_index(index)] = 1;
                                                          return offsets;
                                                        });

  thrust::exclusive_scan(thrust::device,
                         input_iterator,
                         input_iterator + n,
                         make_tabulate_output_iterator(output_op_t{n, pi, pc, pointers}),
                         offsets_t{},
                         [] __host__ __device__(offsets_t lhs, offsets_t rhs) -> offsets_t {
                           offsets_t result{};
                           for (int i = 0; i < N; i++) {
                             result[i] = lhs[i] + rhs[i];
                           }
                           return result;
                         });

  offsets_t counts = counter[0];
  std::cout << counts[0] << " / " << counts[1] << " / " << counts[2] << std::endl;

  for (int j = 0; j < N; j++) {
    for (int i = 0; i < counts[j]; i++) {
      std::cout << "p" << j << "[" << i << "]: " << static_cast<int>(packs[j][i]) << std::endl;
    }
  }
}

// prints:
// 3 / 2 / 2
// p0[0]: 7
// p0[1]: 4
// p0[2]: 1
// p1[0]: 6
// p1[1]: 3
// p2[0]: 5
// p2[1]: 2

The code packs input vector into a number of output buffers based on some query:

This is somewhat similar to N-way partition, but is not limited to it.

I saw is_final being used in this context. We can't place this packing inside binary operator, because we don't know when final accumulator is being computed by CUB. Since we write into output iterator once, transform output iterator is equivalent to injecting user code under is_final. What's missing is the output index. We can't solve this using transform output iterator, because we don't know the index in the output sequence.

I can imagine other use cases, for instance, discarding values for invalid elements.

Mar 05 '24 23:03 gevtushenko

@karthikeyann this is very reiminiscent of what you had proposed a long time ago here: https://github.com/NVIDIA/thrust/pull/1575

I'm curious for your thoughts on this. Would the tabulate_output_iterator described above fit the use case you'd originally had in mind for https://github.com/NVIDIA/thrust/pull/1575 ?

Mar 06 '24 14:03 jrhemstad

Yes. That name would be good fit.

Mar 12 '24 16:03 karthikeyann

cccl cccl copied to clipboard

[FEA]: Tabulate Output Iterator

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

cccl
cccl copied to clipboard