cudf [FEA] Faster path for calculating total output symbols in FST

Is your feature request related to a problem? Please describe. To create a faster path for calculating total output symbols in FST, without redundantly computing some values, such as the state-transition vectors and write offsets in the FST.

Also, if the impact is significant, provide a way to optimize the FST kernel not to redundantly compute number of total output symbols, if it was passed with discard iterator.

Describe the solution you'd like A reduce kernel to compute total number of output symbols.

Describe alternatives you've considered used make_discard_iterator in https://github.com/rapidsai/cudf/pull/16978 The compiler should optimize and remove redundant parts, But still some redundant parts are still computed. @elstehle shows it has 10% runtime impact. Impact on just FST alone would be much more.

Additional context https://github.com/rapidsai/cudf/pull/16978#issuecomment-2408909217 Benchmark shows addtional 10% impact on overall JSON reader runtime.

Oct 17 '24 17:10 karthikeyann

After discussions with @elstehle, here's a brief overview of two possible approaches considered:

Approach 1: Split SimulateDFAKernel after offsets for translated characters are computed
1. Write offsets to global memory, and read last element to get size of translated buffer
2. Launch another kernel that uses this global offsets array and runs DFA to translate the input
Approach 2: Split SimulateDFAKernel after the number of translated characters per thread is computed
1. DeviceReduce and compute total translated buffer size
2. In the next kernel, recompute the number of translated characters per thread i.e. run the DFA simulation again for the start states
3. Prefix sum on the number of translated chars per output and then run final DFA

Though approach 2 has fewer global memory reads and writes, re-computing the number of translated characters per thread can be expensive. The plan is to start with approach 1 and implement DeviceMultiStageTransduce which takes a function object that allocates memory for output offsets and returns output iterators for the transduced output and output indices.

Oct 17 '24 19:10 shrshi

For approach 2,

which takes a function object that allocates memory for output offsets and returns output iterators for the transduced output and output indices

Instead, can we take approach similar to cub, where we can pass nullptr, and the DeviceMultiStageTransduce decides to call DeviceReduce, and estimates output size. if it's not nullptr, it will calculate the transduced output and output indices. Will this work?

Oct 17 '24 20:10 karthikeyann

Instead, can we take approach similar to cub, where we can pass nullptr, and the DeviceMultiStageTransduce decides to call DeviceReduce, and estimates output size. if it's not nullptr, it will calculate the transduced output and output indices. Will this work?

The output iterators don't have to be pointers but can be arbitrary iterators. It will be difficult to agree on a sentinel value (like nullptr for pointers) that indicates "only return the output size but do not write to these iterators". Maybe we could do it if the output iterator types are cub::NullType.

I don't have a clear favorite. I think the advantage of the callback is that I expect it to be easier to implement, because we don't have to figure out the right way to properly re-enter the Dispatch instance after allocating the right outputs.

Here's an example function object implementation that we had in mind:

template<typename OutT, typename OutIndexT>
struct allocate_and_get_output_iterators
{
  device_uvector<OutT> &stack_operations_out;
  device_uvector<OutIndexT> &stack_operations_indexes;
  ::cuda::std::tuple<TranslatedOutItT, TranslatedOutIndexItT> operator()(OffsetT num_transduced_out){
    stack_operations_out.resize(num_transduced_out);
    stack_operations_indexes.resize(num_transduced_out);
    return std::make_tuple(
      stack_operations_out.data(), // or thrust::make_transofrm_iterator(stack_operations_out.data(), my_op_t{}), 
      stack_operations_indexes.data());
  }
}

Another question that came up while I was thinking about all this:

Do we still want to keep the code path for when we are running the FST in "one go" (i.e., continue using decoupled look-back to merge the output offset computation and writing the translated output)? In previous discussions I always assumed the answer was "yes", because I assume it to be more performant. I would like to hear how you are thinking about this?

Oct 18 '24 04:10 elstehle

Do we still want to keep the code path for when we are running the FST in "one go" (i.e., continue using decoupled look-back to merge the output offset computation and writing the translated output)? In previous discussions I always assumed the answer was "yes", because I assume it to be more performant. I would like to hear how you are thinking about this?

IMO, I think we can retain the decoupled look-back approach and introduce a 'low-memory' policy that adopts this new approach i.e. writing output offsets to global memory.

Oct 18 '24 16:10 shrshi