primecount icon indicating copy to clipboard operation
primecount copied to clipboard

By 2025 change default setting to WITH_DIV32=OFF in CMakeLists.txt

Open kimwalisch opened this issue 3 years ago • 4 comments

The current default WITH_DIV32=ON increases branch mispredictions and only makes sense on CPUs with slow 64-bit integer division.

kimwalisch avatar May 04 '22 06:05 kimwalisch

One option is to put the burden of choosing ON vs OFF for WITH_DIV32 on the compiler instead. For example, clang today includes a dynamic check for actual operand sizes in uint64_t division:

https://godbolt.org/z/Kxonbvz75

gcc does not:

https://godbolt.org/z/oKWf4j1M1

zielaj avatar May 04 '22 14:05 zielaj

I have benchmarked WITH_DIV32 on an Intel i5-8400 CPU & Ubuntu x64:

  • clang++-14 & WITH_DIV32=OFF:
$ ./primecount 1e15 -l -s

=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 1000000000000000
a = 1951957
threads = 6

=== phi(x, a) ===
Status: 100%                                      
phi = 29844568470713
Seconds: 4.228

29844570422669
Seconds: 4.229
  • clang++-14 & WITH_DIV32=ON:
$ ./primecount 1e15 -l -s

=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 1000000000000000
a = 1951957
threads = 6

=== phi(x, a) ===
Status: 100%                                      
phi = 29844568470713
Seconds: 4.069

29844570422669
Seconds: 4.070
  • g++-11 & WITH_DIV32=OFF:
$ ./primecount 1e15 -s -l

=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 1000000000000000
a = 1951957
threads = 6

=== phi(x, a) ===
Status: 100%                                      
phi = 29844568470713
Seconds: 7.965

29844570422669
Seconds: 7.965

So the current WITH_DIV32=ON is still slightly faster than clang's default code generation and much faster than GCC's default code generation. Therefore it makes sense to wait a few years before using WITH_DIV32=OFF by default. Also note that the default Gourdon formula benefits much less from WITH_DIV32=ON.

Funnily using floating point division for uint64_t/uint32_t is even faster on my PC. Theoretically if both operands are <= 2^53 then double precision floats can represent all integers <= 2^53 exactly:

#if defined(__DBL_MANT_DIG__) && \
    __DBL_MANT_DIG__ == 53

/// Used for  (64-bit / 32-bit) =  64-bit.
template <typename X, typename Y>
typename std::enable_if<(sizeof(X) == sizeof(uint64_t) && sizeof(Y) <= sizeof(uint32_t)), X>::type
fast_div(X x, Y y)
{
    if (x <= (X)(1ll << 53))
        return (X)((double)x / (double)y);
    else {
        // Unsigned integer division is usually
        // faster than signed integer division.
        using UX = typename std::make_unsigned<X>::type;
        using UY = typename std::make_unsigned<Y>::type;
        return (UX)x / (UY)y;
    }
}

#endif
  • clang++-14 & WITH_DIV32=ON:
$ ./primecount 1e15 -s -l

=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 1000000000000000
a = 1951957
threads = 6

=== phi(x, a) ===
Status: 100%                                      
phi = 29844568470713
Seconds: 3.537

29844570422669
Seconds: 3.538

I don't have much experience with floating point numbers, I am not sure if we can safely rely on casting to floating point, doing the division & casting back to integer. Also I expect that in a few years when CPUs have fast integer division this code will run slower.

kimwalisch avatar May 06 '22 14:05 kimwalisch

And here is the same benchmark using GCC11 on my latest Intel i5-12600 Alder Lake CPU from 2021:

  • g++-11 & WITH_DIV32=ON:
$ ./primecount 1e16 -s -l
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 10000000000000000
a = 5761455
threads = 16

=== phi(x, a) ===
Status: 100%                                      
phi = 279238335272471
Seconds: 11.240

279238341033925
Seconds: 11.245
  • g++-11 & WITH_DIV32=OFF:
$ ./primecount 1e16 -s -l
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 10000000000000000
a = 5761455
threads = 16

=== phi(x, a) ===
Status: 100%                                      
phi = 279238335272471
Seconds: 12.383

279238341033925
Seconds: 12.388
  • g++-11 & floating point division:
$ ./primecount 1e16 -s -l
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 10000000000000000
a = 5761455
threads = 16

=== phi(x, a) ===
Status: 100%                                      
phi = 279238335272471
Seconds: 10.506

279238341033925
Seconds: 10.507

Integer division is already much faster on this newer CPU to the extent that the performance of the 3 different solutions is quite close (for clang14 the performance is even closer). This is another hint that it indeed makes sense to change the default setting to WITH_DIV32=OFF in a few years.

kimwalisch avatar May 06 '22 17:05 kimwalisch

This comment is perhaph against the spirit of this thread, but since we've been discussing integer division optimizations, here's yet another possibility. I've noticed that a lot of CPU in primecount is spent dividing the same number by a sequence of numbers. On processors with slow dividers (eg. Skylake), this use case can be optimized.

Here are my benchmark results for primecount -t1 1e16 on a laptop with a Skylake processor. For newer processors (with fast dividers), the results would probably be very different.

compiler -march orig (seconds) new (seconds) improvement
clang skylake 4.884 4.573 6.35%
clang x86 4.970 4.769 4.04%
gcc skylake 4.903 4.738 3.36%
gcc x86 5.264 4.977 5.46%

EDIT(5/13): After looking closer at this, it seems that in the divisions "a / b" performed, the divisors (b) also repeat (for different dividends a). This means that libdivide is a much better fit here than this technique. This technique is only useful when numbers the divisors do not repeat.

zielaj avatar May 11 '22 19:05 zielaj

Both GCC & Clang now optimize integer division by default, hence I have now disable WITH_DIV32by default in CMakeLists.txt: https://github.com/kimwalisch/primecount/commit/00dcb71b5e7008da5c5e9f48a47a26317d0b6c4c

kimwalisch avatar Mar 23 '24 22:03 kimwalisch