By 2025 change default setting to WITH_DIV32=OFF in CMakeLists.txt
The current default WITH_DIV32=ON increases branch mispredictions and only makes sense on CPUs with slow 64-bit integer division.
One option is to put the burden of choosing ON vs OFF for WITH_DIV32 on the compiler instead. For example, clang today includes a dynamic check for actual operand sizes in uint64_t division:
https://godbolt.org/z/Kxonbvz75
gcc does not:
https://godbolt.org/z/oKWf4j1M1
I have benchmarked WITH_DIV32 on an Intel i5-8400 CPU & Ubuntu x64:
- clang++-14 &
WITH_DIV32=OFF:
$ ./primecount 1e15 -l -s
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 1000000000000000
a = 1951957
threads = 6
=== phi(x, a) ===
Status: 100%
phi = 29844568470713
Seconds: 4.228
29844570422669
Seconds: 4.229
- clang++-14 &
WITH_DIV32=ON:
$ ./primecount 1e15 -l -s
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 1000000000000000
a = 1951957
threads = 6
=== phi(x, a) ===
Status: 100%
phi = 29844568470713
Seconds: 4.069
29844570422669
Seconds: 4.070
- g++-11 &
WITH_DIV32=OFF:
$ ./primecount 1e15 -s -l
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 1000000000000000
a = 1951957
threads = 6
=== phi(x, a) ===
Status: 100%
phi = 29844568470713
Seconds: 7.965
29844570422669
Seconds: 7.965
So the current WITH_DIV32=ON is still slightly faster than clang's default code generation and much faster than GCC's default code generation. Therefore it makes sense to wait a few years before using WITH_DIV32=OFF by default. Also note that the default Gourdon formula benefits much less from WITH_DIV32=ON.
Funnily using floating point division for uint64_t/uint32_t is even faster on my PC. Theoretically if both operands are <= 2^53 then double precision floats can represent all integers <= 2^53 exactly:
#if defined(__DBL_MANT_DIG__) && \
__DBL_MANT_DIG__ == 53
/// Used for (64-bit / 32-bit) = 64-bit.
template <typename X, typename Y>
typename std::enable_if<(sizeof(X) == sizeof(uint64_t) && sizeof(Y) <= sizeof(uint32_t)), X>::type
fast_div(X x, Y y)
{
if (x <= (X)(1ll << 53))
return (X)((double)x / (double)y);
else {
// Unsigned integer division is usually
// faster than signed integer division.
using UX = typename std::make_unsigned<X>::type;
using UY = typename std::make_unsigned<Y>::type;
return (UX)x / (UY)y;
}
}
#endif
- clang++-14 &
WITH_DIV32=ON:
$ ./primecount 1e15 -s -l
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 1000000000000000
a = 1951957
threads = 6
=== phi(x, a) ===
Status: 100%
phi = 29844568470713
Seconds: 3.537
29844570422669
Seconds: 3.538
I don't have much experience with floating point numbers, I am not sure if we can safely rely on casting to floating point, doing the division & casting back to integer. Also I expect that in a few years when CPUs have fast integer division this code will run slower.
And here is the same benchmark using GCC11 on my latest Intel i5-12600 Alder Lake CPU from 2021:
- g++-11 & WITH_DIV32=ON:
$ ./primecount 1e16 -s -l
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 10000000000000000
a = 5761455
threads = 16
=== phi(x, a) ===
Status: 100%
phi = 279238335272471
Seconds: 11.240
279238341033925
Seconds: 11.245
- g++-11 & WITH_DIV32=OFF:
$ ./primecount 1e16 -s -l
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 10000000000000000
a = 5761455
threads = 16
=== phi(x, a) ===
Status: 100%
phi = 279238335272471
Seconds: 12.383
279238341033925
Seconds: 12.388
- g++-11 & floating point division:
$ ./primecount 1e16 -s -l
=== pi_legendre(x) ===
pi(x) = phi(x, a) + a - 1
x = 10000000000000000
a = 5761455
threads = 16
=== phi(x, a) ===
Status: 100%
phi = 279238335272471
Seconds: 10.506
279238341033925
Seconds: 10.507
Integer division is already much faster on this newer CPU to the extent that the performance of the 3 different solutions is quite close (for clang14 the performance is even closer). This is another hint that it indeed makes sense to change the default setting to WITH_DIV32=OFF in a few years.
This comment is perhaph against the spirit of this thread, but since we've been discussing integer division optimizations, here's yet another possibility. I've noticed that a lot of CPU in primecount is spent dividing the same number by a sequence of numbers. On processors with slow dividers (eg. Skylake), this use case can be optimized.
Here are my benchmark results for primecount -t1 1e16 on a laptop with a Skylake processor. For newer processors (with fast dividers), the results would probably be very different.
| compiler | -march | orig (seconds) | new (seconds) | improvement |
|---|---|---|---|---|
| clang | skylake | 4.884 | 4.573 | 6.35% |
| clang | x86 | 4.970 | 4.769 | 4.04% |
| gcc | skylake | 4.903 | 4.738 | 3.36% |
| gcc | x86 | 5.264 | 4.977 | 5.46% |
EDIT(5/13): After looking closer at this, it seems that in the divisions "a / b" performed, the divisors (b) also repeat (for different dividends a). This means that libdivide is a much better fit here than this technique. This technique is only useful when numbers the divisors do not repeat.
Both GCC & Clang now optimize integer division by default, hence I have now disable WITH_DIV32by default in CMakeLists.txt: https://github.com/kimwalisch/primecount/commit/00dcb71b5e7008da5c5e9f48a47a26317d0b6c4c