cub Implementing reduction for non-commutative operations

I'd like to investigate implementing a reduction for associative, but non-commutative operations. Related to https://github.com/NVIDIA/thrust/issues/1434

The this kind of algorithm comes in handy when establishing the global context in parsing a regular language by chunks in parallel, like the token languages for JSON, CSV etc. in cuDF. On a smaller scale, it can also be used to parse digits into integers, if you think of a segmented version of the algorithm.

Interface-wise, I don't see a reason do deviate from what DeviceReduce::Reduce does, so I imagined something like

template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T>
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::ReduceOrdered(
    void *          d_temp_storage,
    size_t &        temp_storage_bytes,
    InputIteratorT  d_in,
    OutputIteratorT d_out,
    int             num_items,
    ReductionOpT    reduction_op,
    T               init,
    cudaStream_t    stream = 0,
    bool            debug_synchronous = false 
)

For the implementation, I would use the following hierarchy:

Host-side: Call the reduction kernel once to compute block-local reductions, and a second time in block partials to compute the final result.
Kernel: Pretty much identical to DeviceReduceKernel
Agent: Similar to AgentReduce, only using LoadDirectBlocked instead of LoadDirectStriped, and doing a local sequential reduction before a block reduction.
Block reduction: I haven't dug deep enough into the different specializations to see where/if they reorder inputs, but at least BlockReduceRaking supports it according to the documentation.

Aug 13 '22 10:08 upsj

Now that I think of it, it might even make sense to reuse all the DeviceReduce infrastructure and only swap out/specialize the agent?

Aug 13 '22 11:08 upsj

Hello @upsj! A few ideas on this:

You could add BlockLoadAlgorithm as a template parameter to AgentReducePolicy
I wouldn't advice to use LoadDirectBlocked in this context. Current tuning has something like 20 items per thread. Loading 20 items by each thread with this stride wouldn't utilize the memory system efficiently. Instead, experimenting with actual block load facilities might be beneficial for this case. Partially specialized agent for BLOCK_LOAD_TRANSPOSE or BLOCK_LOAD_WARP_TRANSPOSE seems reasonable. In any case, benchmarks are needed to check this statement.
New specialization might require different tuning
I'm not sure about the name, you might consider spelling it as ReduceNonCommutative or even requiring this property as an argument Reduce(..., properties<NonCommutative>{}).

You can try to use the following benchmark as a starting point of your research.

Aug 13 '22 12:08 gevtushenko

Thanks for your input, I think that gave me everything I need for the first steps. On top of the BlockLoadAlgorithm, I need to add an init parameter to the agent, because for incomplete tiles, generically determining which loaded elements to include in the ThreadReduce seems difficult, passing init as the OOB element seems safer and simpler. Also, vectorization will probably need to happen in the BlockLoad, and the decision whether to vectorize needs to happen with the policy, not the agent. I think I'll first open a PR adding the BlockLoadAlgorithm parameter and then deal with the actual non-commutative algorithm.

Aug 19 '22 11:08 upsj