zig icon indicating copy to clipboard operation
zig copied to clipboard

std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's

Open Fri3dNstuff opened this issue 1 year ago • 4 comments

according to my tests (on an x86_64 machine, which has a @ctz instruction) Stein's algorithm is on par with Euclid's on small (≤ 100) random inputs, and roughly 40% faster on large random inputs (random u64s).

Fri3dNstuff avatar Aug 14 '24 17:08 Fri3dNstuff

It might be worth testing this on a CPU that doesn't have a hardware ctz instruction or equivalent, just to make sure Stein's algorithm is still a performance improvement there, too! Wikipedia informs me that some modern CPUs are like this and gives the ARM-Mx series as an example.

SilasLock avatar Aug 14 '24 21:08 SilasLock

Can you share your testing methodology?

andrewrk avatar Aug 16 '24 16:08 andrewrk

Since you're trading integer division for bit-twiddling, I imagine stein's would be faster. Here's an example of both implementations compiled for a v4t core (the same as the GBA uses). Note that gcdStd calls __aeabi_udivmod, which ends up calling div_u32, so you're trading one form of bit twiddling for another.

The-King-of-Toasters avatar Aug 17 '24 00:08 The-King-of-Toasters

Can you share your testing methodology?

@andrewrk

sure, it's not anything sophisticated - I'm running through 10,000,000 random pairs of numbers, computing their gcd, once with each algorithm (also once just xoring the values, to make sure the rng doesn't account for much time on its own...).

the three programs are compiled with -OReleaseFast and performance is compared with poop.

the program for Euclid's is given below, the only difference between it and the others is the call to std.math.gcd:

const std = @import("std");
const N = u64;

pub fn main() void {
    // init with a runtime known seed
    var rand = std.Random.Xoroshiro128.init(std.os.argv.len);
    var res: N = 0;
    for (0..10_000_000) |_| {
        res +%= @truncate(std.math.gcd(rand.random().int(N), rand.random().int(N)));
    }
    // do something with the result... can't let LLVM be too smart...
    std.debug.print("{}\n", .{res});
}

Fri3dNstuff avatar Aug 17 '24 19:08 Fri3dNstuff

I copied @Fri3dNstuff's test and ran it on the machines I have access too. TL;DR: stein wins every time, except for some cases:

  • Riscv64 without specifying the Zbb extension. This uses a software ctz, and is slightly slower.
  • The POWER10 cfarm server for some reason?

Some devices are too constrained to run with perf, or it is disabled in its kernel. For this time was used instead.

@andrewrk I could only ever get poop running on x86_64. Every other target gave errors along these lines:

/nix/store/nmq6p6jcmqdpj2bdccb0pna3k4fzpn0f-zig-0.14.0-dev.1261+c26206112/lib/std/posix.zig:7195:23: 0x106c59f in perf_event_open (poop)
/private/tmp/poop/src/main.zig:201:54: 0x106719b in main (poop)
/nix/store/nmq6p6jcmqdpj2bdccb0pna3k4fzpn0f-zig-0.14.0-dev.1261+c26206112/lib/std/start.zig:614:37: 0x105ba57 in posixCallMainAndExit (poop)
???:?:?: 0x0 in ??? (???)

Is this a known issue?

The-King-of-Toasters avatar Aug 24 '24 14:08 The-King-of-Toasters

https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to

scheibo avatar Aug 25 '24 19:08 scheibo

I recommend looking at the code for gcd in algorithmica. It claims to have faster implementation than what this PR is suggesting due to data dependencies between instructions.

See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig.

ProkopRandacek avatar Aug 25 '24 19:08 ProkopRandacek

@scheibo

https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to

translating the C++ code to Zig, comparing with the Stein implementation on randomly generated values of different lengths, yields nearly identical running times (at least - on my machine). below is the translated code, maybe I missed something?

var x: N = a;
var y: N = b;
if (x < y) {
    const tmp = y;
    y = x;
    x = tmp;
}
if (y == 0) return x;
x %= y;
if (x == 0) return y;

const i = @ctz(x);
const j = @ctz(y);
const shift = @min(i, j);
x >>= @intCast(i);
y >>= @intCast(j);

while (true) {
    // undeflow is legal
    const diff = x -% y;
    if (x > y) {
        x = y;
        y = diff;
    } else {
        y -= x;
    }

    // shift must be with value < bit size
    if (diff != 0) y >>= @intCast(@ctz(diff));

    if (y == 0) return x << @intCast(shift);
}

@ProkopRandacek

See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig.

I have tested it against the Stein implementation in the pull request, with random u63 values for both (cast to u64s in the pull request implementation) and indeed your implementation runs 15 to 20 percent faster (!)

the algorithm, however, uses signed integers (expecting the caller to only pass in non-negatives) - do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers? calling an i65 version of the algorithm for full-range u64s and casting back the result seems very expensive...

Fri3dNstuff avatar Aug 26 '24 06:08 Fri3dNstuff

do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers

I don't think there is. And I think that in the name of performance it is more than reasonable to change the function signature from (a: anytype, b: anytype) to (a: i64, b: i64). You probably rarely need to calculate gcd of numbers that don't fit in this range.

Making the function clever and automatically casting u64 to i65 is probably bad because as you said it makes the code a lot slower.

ProkopRandacek avatar Aug 26 '24 08:08 ProkopRandacek

@Fri3dNstuff That's only because std::gcd can accept signed ints. Note that in the libc++ pr there is a static assert that checks if the type of a or b is unsigned. We don't have to care about that.

Btw, I've updated my benchmark with your translation and it's better on all targets except my mipsel router for some reason.

The-King-of-Toasters avatar Aug 26 '24 09:08 The-King-of-Toasters

You probably rarely need to calculate gcd of numbers that don't fit in this range.

Big rationals say hi. :)

Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible

scheibo avatar Aug 26 '24 14:08 scheibo

You probably rarely need to calculate gcd of numbers that don't fit in this range.

Big rationals say hi. :)

On second thought, youre right.

Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible

My worry is that u64 is common and generally regarded as fast data type yet using it here gives you the slow the implementation. Good design here would be to let users fall into the pit of success and use the i64 version. Only when they really need a larger int type and know that there is a cost, should they be given the general implementation.

Here that could look like having 2 functions: gcd(i64, i64) and gcdSlow(anytype, anytype).

ProkopRandacek avatar Aug 26 '24 15:08 ProkopRandacek

I have played around with @ProkopRandacek's algorithm, attempting to make it use unsigned numbers. surprisingly, this new version is 18% faster, running on my machine. can anyone please confirm that this version is indeed faster, and not just a fluke of my computer? if it is, I'll update the pull request with a generic version of this.

here's the algorithm, I ran both it and @ProkopRandacek's through 10,000,000 pairs of random u63s, wrap-summing their results (as described in a of mine comment above). both algorithms produce the same sum: 223029890 - the signed algorithm obviously produces a i64 as the output, but the result is in the positive range, so they agree, we "got lucky" with that...

fn gcd(a: u64, b: u64) u64 {
    std.debug.assert(a != 0 or b != 0);

    var x = a;
    var y = b;

    if (x == 0) return y;
    if (y == 0) return x;

    var xz = @ctz(x);
    const yz = @ctz(y);
    const shift = @min(xz, yz);
    y >>= @intCast(yz);

    while (true) {
        x >>= @intCast(xz);
        var diff = y -% x;
        if (diff == 0) return y << @intCast(shift);
        xz = @ctz(diff);
        if (x > y) diff = -%diff;
        y = @min(x, y);
        x = diff;
    }
}

Fri3dNstuff avatar Aug 27 '24 05:08 Fri3dNstuff

I have refined the algorithm a bit further, and managed to squeeze a few more percent of performance out of it. I believe these improvements are applicable generally (and aren't just LLVM liking the new version better, specifically in the case of my machine) - but without tests on other machines I can't be sure...

changed to loop a bit, and shuffled some declarations around - this seems to help with register allocations. also, I think I understand why this version is faster than @ProkopRandacek's signed one: the negation of diff and the modification of y happen right beside each other, there is only a need to cmp x, y and cmov based on the result! below is the refined algorithm:

fn gcd(a: u64, b: u64) u64 {
    std.debug.assert(a != 0 or b != 0);

    var x = a;
    var y = b;

    if (x == 0) return y;
    if (y == 0) return x;

    const xz = @ctz(x);
    const yz = @ctz(y);
    const shift = @min(xz, yz);
    x >>= @intCast(xz);
    y >>= @intCast(yz);
    var diff = y -% x;

    while (diff != 0) {
        const zeros = @ctz(diff);
        if (x > y) diff = -%diff;
        y = @min(x, y);
        x = diff >> @intCast(zeros);
        diff = y -% x;
    }
    return y << @intCast(shift);
}

for the generic case, I think it would be best to just use the usize version for integer lengths not larger than usize, and cast the result - x86 for example does not have a ctz version for 8 bits (only 16, 32, and 64) - further, using exotic-length integers will make LLVM sprinkle &s everywhere, for the wrapping subtractions.

does anyone know of an architecture where the algorithm works better for some integer length shorter than usize? I know nearly nothing beyond x86 :/

Fri3dNstuff avatar Aug 27 '24 08:08 Fri3dNstuff

@Fri3dNstuff seems to work fine for me. I altered the benchmark to generate ints that are 1 bit less than usize before extending to usize again (i.e. u31 -> u32, u63 -> u64). This new version still comes out on top.

I also added a build option to change the bit width size of the random integers. They're extended to usize but you can hack it to change it how you want.

The-King-of-Toasters avatar Aug 27 '24 13:08 The-King-of-Toasters

I changed to implementation to use the optimised version, @The-King-of-Toasters, thank you so much for the perf tests!

there is still the question of expanding the width of the integers used in the calculation, in case the user wants to use some exotic-length ints (currently the algorithm performs quite badly with them). LLVM generates a bunch of n & std.math.maxInt(@TypeOf(n))s to zero out parts of the number whenever -% is used...

we may have to make a table of the fast integer sizes for each architecture, and do some comptime switches for the implementation based on that (on my machine it seems to be fastest to always convert to u64s before the calculation, and @intCast back at the end)

Fri3dNstuff avatar Aug 28 '24 07:08 Fri3dNstuff