Implementing reduction for non-commutative operations
I'd like to investigate implementing a reduction for associative, but non-commutative operations. Related to https://github.com/NVIDIA/thrust/issues/1434
The this kind of algorithm comes in handy when establishing the global context in parsing a regular language by chunks in parallel, like the token languages for JSON, CSV etc. in cuDF. On a smaller scale, it can also be used to parse digits into integers, if you think of a segmented version of the algorithm.
Interface-wise, I don't see a reason do deviate from what DeviceReduce::Reduce does, so I imagined something like
template<typename InputIteratorT, typename OutputIteratorT, typename ReductionOpT, typename T>
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceReduce::ReduceOrdered(
void * d_temp_storage,
size_t & temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
int num_items,
ReductionOpT reduction_op,
T init,
cudaStream_t stream = 0,
bool debug_synchronous = false
)
For the implementation, I would use the following hierarchy:
- Host-side: Call the reduction kernel once to compute block-local reductions, and a second time in block partials to compute the final result.
- Kernel: Pretty much identical to
DeviceReduceKernel - Agent: Similar to
AgentReduce, only usingLoadDirectBlockedinstead ofLoadDirectStriped, and doing a local sequential reduction before a block reduction. - Block reduction: I haven't dug deep enough into the different specializations to see where/if they reorder inputs, but at least BlockReduceRaking supports it according to the documentation.
Now that I think of it, it might even make sense to reuse all the DeviceReduce infrastructure and only swap out/specialize the agent?
Hello @upsj! A few ideas on this:
- You could add
BlockLoadAlgorithmas a template parameter toAgentReducePolicy - I wouldn't advice to use
LoadDirectBlockedin this context. Current tuning has something like 20 items per thread. Loading 20 items by each thread with this stride wouldn't utilize the memory system efficiently. Instead, experimenting with actual block load facilities might be beneficial for this case. Partially specialized agent forBLOCK_LOAD_TRANSPOSEorBLOCK_LOAD_WARP_TRANSPOSEseems reasonable. In any case, benchmarks are needed to check this statement. - New specialization might require different tuning
- I'm not sure about the name, you might consider spelling it as
ReduceNonCommutativeor even requiring this property as an argumentReduce(..., properties<NonCommutative>{}).
You can try to use the following benchmark as a starting point of your research.
Thanks for your input, I think that gave me everything I need for the first steps. On top of the BlockLoadAlgorithm, I need to add an init parameter to the agent, because for incomplete tiles, generically determining which loaded elements to include in the ThreadReduce seems difficult, passing init as the OOB element seems safer and simpler. Also, vectorization will probably need to happen in the BlockLoad, and the decision whether to vectorize needs to happen with the policy, not the agent. I think I'll first open a PR adding the BlockLoadAlgorithm parameter and then deal with the actual non-commutative algorithm.