DALI icon indicating copy to clipboard operation
DALI copied to clipboard

proposed more optimized versions of next_pow2 and prev_pow2 in util.h

Open ivHeisser opened this issue 2 months ago • 6 comments

• The current implementation of functions <next_pow2> and <prev_pow2> has complexity O(log(n)). A more optimal version of these functions for CPUs and NVIDIA GPUs has been proposed. Fully bitwise versions run in O(1) for CPU and NVIDIA GPU instead of O(log(n)). The efficiency of the bitwise version is that it doesn't use a loop, but performs only a fixed number of operations, propagating the most significant set bit to the right using bitwise shifts. This turns O(log n) into O(1) (for a fixed type size, such as 32 or 64 bits).

• Added description to <next_pow2>, <prev_pow2> and <is_pow2> functions.

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

  • Refactoring to improve performance

Additional information:

Affected modules and functionalities:

updated include/dali/core/utils.h

Key points relevant for the review:

focus on code realization.

Tests:

Existing tests apply

If you select Existing tests apply option, please list which test cases cover the introduced functionality. For example:

  • test_operator_gaussian_blur.py: test_gaussian*
  • tensor_list_test.cc: TensorListVariableBatchSizeTest* --->
  • [ ] Existing tests apply
  • [ ] New tests added
    • [ ] Python tests
    • [ ] GTests
    • [ *] Benchmark
    • [ ] Other
  • [ ] N/A

Checklist

Documentation

  • [ ] Existing documentation applies
  • [ ] Documentation updated
    • [ ] Docstring
    • [ ] Doxygen
    • [ ] RST
    • [ ] Jupyter
    • [ ] Other
  • [ ] N/A

DALI team only

Requirements

  • [ ] Implements new requirements
  • [ ] Affects existing requirements
  • [ ] N/A

REQ IDs: N/A

JIRA TASK: N/A

ivHeisser avatar Nov 15 '25 16:11 ivHeisser

Greptile Overview

Greptile Summary

This PR optimizes the next_pow2 and prev_pow2 utility functions in include/dali/core/util.h by replacing O(log n) loop-based implementations with O(1) bitwise operations. The optimization introduces hardware intrinsics (__clz/__clzll) for CUDA device code to efficiently count leading zeros, while providing a portable CPU fallback using bit propagation algorithms. Additionally, a new is_pow2 function is added using the standard (n & (n-1)) == 0 bit manipulation technique.

These utility functions are fundamental building blocks used throughout DALI's data processing pipelines for memory alignment calculations, buffer sizing operations, and tensor dimension management. The performance improvement is particularly significant for GPU workloads where these functions may be called frequently in device kernels. The change maintains API compatibility while providing comprehensive documentation for edge case behavior.

Important Files Changed

Filename Score Overview
include/dali/core/util.h 2/5 Critical optimization to power-of-2 utility functions with performance improvements but contains constexpr violations and incorrect edge case handling

Confidence score: 2/5

  • This PR has significant implementation issues that could break existing functionality and compilation
  • Score lowered due to constexpr violations in CPU fallback code, incorrect is_pow2 behavior for n=0, and potential edge case failures in type promotion logic
  • Pay close attention to the constexpr compatibility and mathematical correctness of the is_pow2 function

greptile-apps[bot] avatar Nov 15 '25 16:11 greptile-apps[bot]

  1. The current implementation has, in fact O(log2(N)) complexity, not O(N) as claimed in the description.
  2. The proposed log2(N) solution has no advantage over existing code - in fact, it's much more complex and very likely slower.

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

constexpr int next_pow2(int x)
{
    if (x <= 1)
        return 1;
    int y = 1 << (31 - __builtin_clz(x));
    return x > y ? y << 1 : y;
}

Regarding prev_pow2 - it makes no sense to implement it in terms of next_pow2, because it's actually much simpler when using __clz:

constexpr int prev_pow2(int x)
{
    if (x < 1)
        return 0;
    return 1 << (31 - __builtin_clz(x));
}

I'd recommend keeping separate implementations for those two functions.

mzient avatar Nov 17 '25 09:11 mzient

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

I can also add the support of MSVC with _lzcnt_u32 / _lzcnt_u64, if it is needed.

ivHeisser avatar Nov 17 '25 21:11 ivHeisser

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

I can also add the support of MSVC with _lzcnt_u32 / _lzcnt_u64, if it is needed.

I wouldn't do it, since our code won't compile with MSVC anyway and there's no way we could test it.

mzient avatar Nov 18 '25 10:11 mzient

  1. The current implementation has, in fact O(log2(N)) complexity, not O(N) as claimed in the description.
  2. The proposed log2(N) solution has no advantage over existing code - in fact, it's much more complex and very likely slower.

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

constexpr int next_pow2(int x)
{
    if (x <= 1)
        return 1;
    int y = 1 << (31 - __builtin_clz(x));
    return x > y ? y << 1 : y;
}

Regarding prev_pow2 - it makes no sense to implement it in terms of next_pow2, because it's actually much simpler when using __clz:

constexpr int prev_pow2(int x)
{
    if (x < 1)
        return 0;
    return 1 << (31 - __builtin_clz(x));
}

I'd recommend keeping separate implementations for those two functions.

  • I would like first to check the productivity gains/losses with and without __builtin_clz in my benchmark before move forward to __builtin_clz.

ivHeisser avatar Dec 02 '25 02:12 ivHeisser

  1. The current implementation has, in fact O(log2(N)) complexity, not O(N) as claimed in the description.
  2. The proposed log2(N) solution has no advantage over existing code - in fact, it's much more complex and very likely slower.

GCC and Clang support __builtin_clz which is a constexpr function. The following code works fine (of course, it's just for int).

constexpr int next_pow2(int x)
{
    if (x <= 1)
        return 1;
    int y = 1 << (31 - __builtin_clz(x));
    return x > y ? y << 1 : y;
}

Regarding prev_pow2 - it makes no sense to implement it in terms of next_pow2, because it's actually much simpler when using __clz:

constexpr int prev_pow2(int x)
{
    if (x < 1)
        return 0;
    return 1 << (31 - __builtin_clz(x));
}

I'd recommend keeping separate implementations for those two functions.

* I would like first to check the productivity gains/losses with and without `__builtin_clz` in [my benchmark](https://github.com/ivHeisser/DALI/blob/pow2_bench/dali/benchmark/pow2_bench.cu#L167) before move forward to `__builtin_clz`.

A quick test from my side (running on a single core of an Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz):

Testing 64-bit types
CLZ: 0.963019 ns/number
shift/or: 1.52322 ns/number
Testing 32-bit types
CLZ: 0.924483 ns/number
shift/or: 1.39187 ns/number
Testing 16-bit types
CLZ: 0.995852 ns/number
shift/or: 1.99418 ns/number
Testing 8-bit types
CLZ: 1.06897 ns/number
shift/or: 1.13313 ns/number

mzient avatar Dec 02 '25 13:12 mzient