std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's
according to my tests (on an x86_64 machine, which has a @ctz instruction) Stein's algorithm is on par with Euclid's on small (≤ 100) random inputs, and roughly 40% faster on large random inputs (random u64s).
It might be worth testing this on a CPU that doesn't have a hardware ctz instruction or equivalent, just to make sure Stein's algorithm is still a performance improvement there, too! Wikipedia informs me that some modern CPUs are like this and gives the ARM-Mx series as an example.
Can you share your testing methodology?
Since you're trading integer division for bit-twiddling, I imagine stein's would be faster. Here's an example of both implementations compiled for a v4t core (the same as the GBA uses). Note that gcdStd calls __aeabi_udivmod, which ends up calling div_u32, so you're trading one form of bit twiddling for another.
Can you share your testing methodology?
@andrewrk
sure, it's not anything sophisticated - I'm running through 10,000,000 random pairs of numbers, computing their gcd, once with each algorithm (also once just xoring the values, to make sure the rng doesn't account for much time on its own...).
the three programs are compiled with -OReleaseFast and performance is compared with poop.
the program for Euclid's is given below, the only difference between it and the others is the call to std.math.gcd:
const std = @import("std");
const N = u64;
pub fn main() void {
// init with a runtime known seed
var rand = std.Random.Xoroshiro128.init(std.os.argv.len);
var res: N = 0;
for (0..10_000_000) |_| {
res +%= @truncate(std.math.gcd(rand.random().int(N), rand.random().int(N)));
}
// do something with the result... can't let LLVM be too smart...
std.debug.print("{}\n", .{res});
}
I copied @Fri3dNstuff's test and ran it on the machines I have access too. TL;DR: stein wins every time, except for some cases:
- Riscv64 without specifying the
Zbbextension. This uses a softwarectz, and is slightly slower. - The POWER10 cfarm server for some reason?
Some devices are too constrained to run with perf, or it is disabled in its kernel. For this time was used instead.
@andrewrk I could only ever get poop running on x86_64. Every other target gave errors along these lines:
/nix/store/nmq6p6jcmqdpj2bdccb0pna3k4fzpn0f-zig-0.14.0-dev.1261+c26206112/lib/std/posix.zig:7195:23: 0x106c59f in perf_event_open (poop)
/private/tmp/poop/src/main.zig:201:54: 0x106719b in main (poop)
/nix/store/nmq6p6jcmqdpj2bdccb0pna3k4fzpn0f-zig-0.14.0-dev.1261+c26206112/lib/std/start.zig:614:37: 0x105ba57 in posixCallMainAndExit (poop)
???:?:?: 0x0 in ??? (???)
Is this a known issue?
https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to
I recommend looking at the code for gcd in algorithmica. It claims to have faster implementation than what this PR is suggesting due to data dependencies between instructions.
See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig.
@scheibo
https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to
translating the C++ code to Zig, comparing with the Stein implementation on randomly generated values of different lengths, yields nearly identical running times (at least - on my machine). below is the translated code, maybe I missed something?
var x: N = a;
var y: N = b;
if (x < y) {
const tmp = y;
y = x;
x = tmp;
}
if (y == 0) return x;
x %= y;
if (x == 0) return y;
const i = @ctz(x);
const j = @ctz(y);
const shift = @min(i, j);
x >>= @intCast(i);
y >>= @intCast(j);
while (true) {
// undeflow is legal
const diff = x -% y;
if (x > y) {
x = y;
y = diff;
} else {
y -= x;
}
// shift must be with value < bit size
if (diff != 0) y >>= @intCast(@ctz(diff));
if (y == 0) return x << @intCast(shift);
}
@ProkopRandacek
See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig.
I have tested it against the Stein implementation in the pull request, with random u63 values for both (cast to u64s in the pull request implementation) and indeed your implementation runs 15 to 20 percent faster (!)
the algorithm, however, uses signed integers (expecting the caller to only pass in non-negatives) - do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers? calling an i65 version of the algorithm for full-range u64s and casting back the result seems very expensive...
do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers
I don't think there is. And I think that in the name of performance it is more than reasonable to change the function signature from (a: anytype, b: anytype) to (a: i64, b: i64). You probably rarely need to calculate gcd of numbers that don't fit in this range.
Making the function clever and automatically casting u64 to i65 is probably bad because as you said it makes the code a lot slower.
@Fri3dNstuff That's only because std::gcd can accept signed ints. Note that in the libc++ pr there is a static assert that checks if the type of a or b is unsigned. We don't have to care about that.
Btw, I've updated my benchmark with your translation and it's better on all targets except my mipsel router for some reason.
You probably rarely need to calculate gcd of numbers that don't fit in this range.
Big rationals say hi. :)
Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible
You probably rarely need to calculate gcd of numbers that don't fit in this range.
Big rationals say hi. :)
On second thought, youre right.
Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible
My worry is that u64 is common and generally regarded as fast data type yet using it here gives you the slow the implementation. Good design here would be to let users fall into the pit of success and use the i64 version. Only when they really need a larger int type and know that there is a cost, should they be given the general implementation.
Here that could look like having 2 functions: gcd(i64, i64) and gcdSlow(anytype, anytype).
I have played around with @ProkopRandacek's algorithm, attempting to make it use unsigned numbers. surprisingly, this new version is 18% faster, running on my machine. can anyone please confirm that this version is indeed faster, and not just a fluke of my computer? if it is, I'll update the pull request with a generic version of this.
here's the algorithm, I ran both it and @ProkopRandacek's through 10,000,000 pairs of random u63s, wrap-summing their results (as described in a of mine comment above). both algorithms produce the same sum: 223029890 - the signed algorithm obviously produces a i64 as the output, but the result is in the positive range, so they agree, we "got lucky" with that...
fn gcd(a: u64, b: u64) u64 {
std.debug.assert(a != 0 or b != 0);
var x = a;
var y = b;
if (x == 0) return y;
if (y == 0) return x;
var xz = @ctz(x);
const yz = @ctz(y);
const shift = @min(xz, yz);
y >>= @intCast(yz);
while (true) {
x >>= @intCast(xz);
var diff = y -% x;
if (diff == 0) return y << @intCast(shift);
xz = @ctz(diff);
if (x > y) diff = -%diff;
y = @min(x, y);
x = diff;
}
}
I have refined the algorithm a bit further, and managed to squeeze a few more percent of performance out of it. I believe these improvements are applicable generally (and aren't just LLVM liking the new version better, specifically in the case of my machine) - but without tests on other machines I can't be sure...
changed to loop a bit, and shuffled some declarations around - this seems to help with register allocations. also, I think I understand why this version is faster than @ProkopRandacek's signed one: the negation of diff and the modification of y happen right beside each other, there is only a need to cmp x, y and cmov based on the result! below is the refined algorithm:
fn gcd(a: u64, b: u64) u64 {
std.debug.assert(a != 0 or b != 0);
var x = a;
var y = b;
if (x == 0) return y;
if (y == 0) return x;
const xz = @ctz(x);
const yz = @ctz(y);
const shift = @min(xz, yz);
x >>= @intCast(xz);
y >>= @intCast(yz);
var diff = y -% x;
while (diff != 0) {
const zeros = @ctz(diff);
if (x > y) diff = -%diff;
y = @min(x, y);
x = diff >> @intCast(zeros);
diff = y -% x;
}
return y << @intCast(shift);
}
for the generic case, I think it would be best to just use the usize version for integer lengths not larger than usize, and cast the result - x86 for example does not have a ctz version for 8 bits (only 16, 32, and 64) - further, using exotic-length integers will make LLVM sprinkle &s everywhere, for the wrapping subtractions.
does anyone know of an architecture where the algorithm works better for some integer length shorter than usize? I know nearly nothing beyond x86 :/
@Fri3dNstuff seems to work fine for me. I altered the benchmark to generate ints that are 1 bit less than usize before extending to usize again (i.e. u31 -> u32, u63 -> u64). This new version still comes out on top.
I also added a build option to change the bit width size of the random integers. They're extended to usize but you can hack it to change it how you want.
I changed to implementation to use the optimised version, @The-King-of-Toasters, thank you so much for the perf tests!
there is still the question of expanding the width of the integers used in the calculation, in case the user wants to use some exotic-length ints (currently the algorithm performs quite badly with them). LLVM generates a bunch of n & std.math.maxInt(@TypeOf(n))s to zero out parts of the number whenever -% is used...
we may have to make a table of the fast integer sizes for each architecture, and do some comptime switches for the implementation based on that (on my machine it seems to be fastest to always convert to u64s before the calculation, and @intCast back at the end)