STL icon indicating copy to clipboard operation
STL copied to clipboard

`<bit>`: Could `has_single_bit()` be faster?

Open pps83 opened this issue 9 months ago • 6 comments

Seems like it should convert to popcount(x) == 1 if it's available on the target arch, but looking at the code, doesn't seem like compiler would do it:

_NODISCARD constexpr bool has_single_bit(const _Ty _Val) noexcept {
    return _Val != 0 && (_Val & (_Val - 1)) == 0;
}

https://github.com/microsoft/STL/blob/f2a2933dd65d9e8d3fa698a97b6074f7ef00e1fd/stl/inc/bit#L87

pps83 avatar Mar 25 '25 15:03 pps83

I'm not sure.

For the current implementation I'd expect something as following (assuming x64 arch, as it is primary optimization target):

or   reg1, reg1
jz   label1
lea  reg2, [reg1-1]
test reg1, reg2
label1:

And for popcount as following;

popcnt reg2, reg1
cmp    reg2, 1

Which of these is better would highly depend on the context. I'd expect the current branchy function to win for predictable input, and the popcnt branchless function towin for unpredictable input.

In sight of this uncertainty, I would avoid making any changes and introducing unnecessary complexity.

AlexGuteniev avatar Mar 25 '25 17:03 AlexGuteniev

Oh, it also seems CPU vendor specific. AMDs have cheaper popcnt, could be the winner most of the time there.

AlexGuteniev avatar Mar 25 '25 18:03 AlexGuteniev

According to GCC codegen, popcnt is the winner if available, yes.

If not, the best answer is (v ^ (v-1)) > v-1.

https://godbolt.org/z/d61bxW4r1

Alcaro avatar Mar 26 '25 13:03 Alcaro

Ideally, our compilers would do the right thing here, so we may need to report optimization bug(s) to C1XX and/or Clang.

StephanTLavavej avatar Mar 26 '25 21:03 StephanTLavavej

According to GCC codegen, popcnt is the winner if available, yes.

Yes, seems like gcc/clang use popcount for cpus starting from nehalem: https://godbolt.org/z/s4MnvcfW7 Nehalem was released in 2008.

If not, the best answer is (v ^ (v-1)) > v-1.

https://godbolt.org/z/d61bxW4r1

this one seems to be better for msvc and others for generic cpu. Looks like gcc/clang use this one as the codegen matches std::has_single_bit for x64/arm cpus

pps83 avatar Mar 27 '25 00:03 pps83

I just edited that file in github.dev and made that PR directly from the browser. Crazy, full blown code editor in browser :)

Image

pps83 avatar Mar 27 '25 00:03 pps83

@pps83 are you interested in popcnt PR? Otherwise I can do it.

AlexGuteniev avatar May 23 '25 07:05 AlexGuteniev

@pps83 are you interested in popcnt PR? Otherwise I can do it.

I can do it, but I'll need some assistance to be able to do it quickly. I just looked into it and here's how it can be done.

bit includes __msvc_bit_utils.hpp. __msvc_bit_utils.hpp defines _HAS_POPCNT_INTRINSICS, but it gets undefined. I need to change that, to keep the define so that inside bit I could do something like that:

_EXPORT_STD template <_Standard_unsigned_integral _Ty>
_NODISCARD constexpr bool has_single_bit(const _Ty _Val) noexcept {
#if _HAS_POPCNT_INTRINSICS
    if (_STD is_constant_evaluated()) {
        return _Val != 0 && (_Val & (_Val - 1)) == 0;
    }
    return _Unchecked_popcount(_Val) == 1;
#else
    return _Val != 0 && (_Val & (_Val - 1)) == 0;
#endif // _HAS_POPCNT_INTRINSICS
}

there is already _HAS_COUNTL_ZERO_INTRINSICS define that similarly gets defined inside __msvc_bit_utils.hpp and used by bit header.

Is that ok?

pps83 avatar May 23 '25 09:05 pps83

https://github.com/microsoft/STL/pull/5534

pps83 avatar May 23 '25 09:05 pps83

there is already _HAS_COUNTL_ZERO_INTRINSICS define that similarly gets defined inside __msvc_bit_utils.hpp and used by bit header.

Is that ok?

Yes, I think so.

AlexGuteniev avatar May 23 '25 10:05 AlexGuteniev