proposed more optimized versions of next_pow2 and prev_pow2 in util.h
• The current implementation of functions <next_pow2> and <prev_pow2> has complexity O(log(n)). A more optimal version of these functions for CPUs and NVIDIA GPUs has been proposed. Fully bitwise versions run in O(1) for CPU and NVIDIA GPU instead of O(log(n)). The efficiency of the bitwise version is that it doesn't use a loop, but performs only a fixed number of operations, propagating the most significant set bit to the right using bitwise shifts. This turns O(log n) into O(1) (for a fixed type size, such as 32 or 64 bits).
• Added description to <next_pow2>, <prev_pow2> and <is_pow2> functions.
Category:
Refactoring (Redesign of existing code that doesn't affect functionality)
Description:
- Refactoring to improve performance
Additional information:
Affected modules and functionalities:
updated include/dali/core/utils.h
Key points relevant for the review:
focus on code realization.
Tests:
Existing tests apply
If you select Existing tests apply option, please list which test cases cover the introduced
functionality. For example:
- test_operator_gaussian_blur.py: test_gaussian*
- tensor_list_test.cc: TensorListVariableBatchSizeTest* --->
- [ ] Existing tests apply
- [ ] New tests added
- [ ] Python tests
- [ ] GTests
- [ *] Benchmark
- [ ] Other
- [ ] N/A
Checklist
Documentation
- [ ] Existing documentation applies
- [ ] Documentation updated
- [ ] Docstring
- [ ] Doxygen
- [ ] RST
- [ ] Jupyter
- [ ] Other
- [ ] N/A
DALI team only
Requirements
- [ ] Implements new requirements
- [ ] Affects existing requirements
- [ ] N/A
REQ IDs: N/A
JIRA TASK: N/A
Greptile Overview
Greptile Summary
This PR optimizes the next_pow2 and prev_pow2 utility functions in include/dali/core/util.h by replacing O(log n) loop-based implementations with O(1) bitwise operations. The optimization introduces hardware intrinsics (__clz/__clzll) for CUDA device code to efficiently count leading zeros, while providing a portable CPU fallback using bit propagation algorithms. Additionally, a new is_pow2 function is added using the standard (n & (n-1)) == 0 bit manipulation technique.
These utility functions are fundamental building blocks used throughout DALI's data processing pipelines for memory alignment calculations, buffer sizing operations, and tensor dimension management. The performance improvement is particularly significant for GPU workloads where these functions may be called frequently in device kernels. The change maintains API compatibility while providing comprehensive documentation for edge case behavior.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
include/dali/core/util.h |
2/5 | Critical optimization to power-of-2 utility functions with performance improvements but contains constexpr violations and incorrect edge case handling |
Confidence score: 2/5
- This PR has significant implementation issues that could break existing functionality and compilation
- Score lowered due to constexpr violations in CPU fallback code, incorrect
is_pow2behavior for n=0, and potential edge case failures in type promotion logic - Pay close attention to the constexpr compatibility and mathematical correctness of the
is_pow2function
- The current implementation has, in fact
O(log2(N))complexity, notO(N)as claimed in the description. - The proposed
log2(N)solution has no advantage over existing code - in fact, it's much more complex and very likely slower.
GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).
constexpr int next_pow2(int x)
{
if (x <= 1)
return 1;
int y = 1 << (31 - __builtin_clz(x));
return x > y ? y << 1 : y;
}
Regarding prev_pow2 - it makes no sense to implement it in terms of next_pow2, because it's actually much simpler when using __clz:
constexpr int prev_pow2(int x)
{
if (x < 1)
return 0;
return 1 << (31 - __builtin_clz(x));
}
I'd recommend keeping separate implementations for those two functions.
GCC and Clang support
__builtin_clzwhich is a constexpr function. The following code works fine (of course, it's just for int).
I can also add the support of MSVC with _lzcnt_u32 / _lzcnt_u64, if it is needed.
GCC and Clang support
__builtin_clzwhich is a constexpr function. The following code works fine (of course, it's just for int).I can also add the support of
MSVCwith_lzcnt_u32 / _lzcnt_u64, if it is needed.
I wouldn't do it, since our code won't compile with MSVC anyway and there's no way we could test it.
- The current implementation has, in fact
O(log2(N))complexity, notO(N)as claimed in the description.- The proposed
log2(N)solution has no advantage over existing code - in fact, it's much more complex and very likely slower.GCC and Clang support
__builtin_clzwhich is a constexpr function. The following code works fine (of course, it's just for int).constexpr int next_pow2(int x) { if (x <= 1) return 1; int y = 1 << (31 - __builtin_clz(x)); return x > y ? y << 1 : y; }Regarding
prev_pow2- it makes no sense to implement it in terms ofnext_pow2, because it's actually much simpler when using__clz:constexpr int prev_pow2(int x) { if (x < 1) return 0; return 1 << (31 - __builtin_clz(x)); }I'd recommend keeping separate implementations for those two functions.
- I would like first to check the productivity gains/losses with and without
__builtin_clzin my benchmark before move forward to__builtin_clz.
- The current implementation has, in fact
O(log2(N))complexity, notO(N)as claimed in the description.- The proposed
log2(N)solution has no advantage over existing code - in fact, it's much more complex and very likely slower.GCC and Clang support
__builtin_clzwhich is a constexpr function. The following code works fine (of course, it's just for int).constexpr int next_pow2(int x) { if (x <= 1) return 1; int y = 1 << (31 - __builtin_clz(x)); return x > y ? y << 1 : y; }Regarding
prev_pow2- it makes no sense to implement it in terms ofnext_pow2, because it's actually much simpler when using__clz:constexpr int prev_pow2(int x) { if (x < 1) return 0; return 1 << (31 - __builtin_clz(x)); }I'd recommend keeping separate implementations for those two functions.
* I would like first to check the productivity gains/losses with and without `__builtin_clz` in [my benchmark](https://github.com/ivHeisser/DALI/blob/pow2_bench/dali/benchmark/pow2_bench.cu#L167) before move forward to `__builtin_clz`.
A quick test from my side (running on a single core of an Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz):
Testing 64-bit types
CLZ: 0.963019 ns/number
shift/or: 1.52322 ns/number
Testing 32-bit types
CLZ: 0.924483 ns/number
shift/or: 1.39187 ns/number
Testing 16-bit types
CLZ: 0.995852 ns/number
shift/or: 1.99418 ns/number
Testing 8-bit types
CLZ: 1.06897 ns/number
shift/or: 1.13313 ns/number