[FEA]: Explore alternative tuning policy design
Is this a duplicate?
- [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
CUB
Is your feature request related to a problem? Please describe.
The c.parallel project re-uses CUB tuning policies through ptx-json. This approach has two disadvantages:
Maintainability
Policies have to be registered with special macro:
CUB_DETAIL_POLICY_WRAPPER_DEFINE(
ReduceAgentPolicy,
(GenericAgentPolicy),
(BLOCK_THREADS, BlockThreads, int),
(ITEMS_PER_THREAD, ItemsPerThread, int),
(VECTOR_LOAD_LENGTH, VectorLoadLength, int),
(BLOCK_ALGORITHM, BlockAlgorithm, cub::BlockReduceAlgorithm),
(LOAD_MODIFIER, LoadModifier, cub::CacheLoadModifier))
Essentially, this duplicates policy description, making it harder to maintain policy structure. The ptx-json machinery also imposes certain structure on CUB code that's hard to maintain and extend. We assume that there's a more maintainable way to re-use tuning policies between c.parallel and CUB.
Performance
The ptx-json relies on forming strings at compile time during kernel JIT-ting and later parsing these json strings. Looking at the radix sort test, parsing json takes negligible time. Forming strings at compile time, on the other hand, is noticeable. Radix sort is notoriously known for long kernel build times, but even there, forming ptx json strings takes 8% of build time. The ratio might be higher for simpler kernels like reduce. We assume that there's a more performant awy to re-use tuning policies between c.parallel and CUB.
Describe the solution you'd like
@bernhardmgruber suggested a constexpr-based solution, something along the lines of:
struct ct_tuning_instance {
int block_size;
int items_per_thread;
};
template <ct_tuning_instance tuning>
struct tuning_instance {
int block_size = tuning.block_size;
itn items_per_thread = tuning.items_per_thread;
};
constexpr ct_tuning_instance get_sm80(cccl_type_enum type) {
switch (type) {
case int32: return {128, 2};
...
}
}
constexpr int get_sm90(cccl_type_enum type) {
switch (type) {
case int32: return {256, 1};
...
}
}
template <class InputIt>
struct default_hub {
using sm80 = tuning_instance<get_sm80(get_property1_at_ct<value_type<InputIt>>())>;
using sm90 = tuning_instance<get_sm90(get_property1_at_ct<value_type<InputIt>>())>;
};
template <class InputItT, ...,
class PolicyHub = default_hub<InputItT, ...>>
struct Alg { ...
This way, we could use get_sm80 and get_sm90 funcions on C++ end to get tunings at compile time, but also pass runtime enum values on c.parallel end to get tuning policy without relying on NVRTC. Potentially, this could simplify tuning re-use without breaking users passing custom tuning policies to CUB dispatch layer.
This issue is about developing proof-of-concept constexpr-based tuning re-use implementation for evaluation purposes. The issue can be closed with:
- a branch showing example of constexpr-based tuning used in device reduce instead of ptx-json
- performance comparison of ptx-json vs constexpr-based build step for device reduce
Describe alternatives you've considered
No response
Additional context
No response
The current state in #6544 completely removes the need for JSON handling and policy wrappers by using constexpr functions, which can be used at runtime in CCCL.C and at compile-time in CUB and inside a kernel.
a branch showing example of constexpr-based tuning used in device reduce instead of ptx-json
This is already done in #6544
performance comparison of ptx-json vs constexpr-based build step for device reduce
This may be needed in #6544 to confirm that the new approach does not introduce any regressions.
The new design, as described in #6544, was design approved yesterday. That concludes the exploration requested here.