FixedPointDecimals.jl Finish removing the BigInts from * for FD{Int128}!

Finally implements the fast-multiplication optimization from https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45! :)

This PR optimizes multiplication for FixedDecimal{Int64} and FixedDecimal{Int128}. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉

This is a follow-up to https://github.com/JuliaMath/FixedPointDecimals.jl/pull/93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.

Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.

After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.

master:

julia> using FixedPointDecimals, BenchmarkTools

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  84.916 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  249.083 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.660 ms (248829 allocations: 4.70 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  172.958 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  5.408 ms (308621 allocations: 6.14 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After PR #93:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  85.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  248.958 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.673 ms (160798 allocations: 3.22 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.041 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  173.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  4.750 ms (190708 allocations: 4.82 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After this PR:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  48.458 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  57.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  90.708 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  180.125 μs (0 allocations: 0 bytes)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  42.708 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  51.250 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  80.042 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  162.417 μs (0 allocations: 0 bytes)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

Jun 13 '24 02:06 NHDaly

Huzzah! I finally feel like we can bring the ideas from https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for FixedDecimal{Int64} is already fast). But this PR adds support for FixedDecimal{Int128}, which needed a custom function for Int256 div by const. :)

See here, for my note on FixedDecimal{Int64} support: https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45#issuecomment-2163836966

Jun 13 '24 02:06 NHDaly

Now that there's julia's effects system, this doesn't need @pure, so it's actually possible to do this! And now that we've introduced BitIntegers.jl, the code is way simpler. 😊 woohoo!

Jun 13 '24 02:06 NHDaly

Update 1: It turned out that (of course) * is too fast for @btime to get a good measurement of on its own. I've adjusted the benchmark to perform repeated multiplies, and the true performance is now included in the original comment.

Update 2: It turns out that this is actually faster for all int types (>2x faster for FD{Int64}), even though LLVM was doing a version of this optimization on its own. I'm honestly pretty surprised by that... From what I can tell, for FD{Int32}, the code is identical except for two instruction changes: one changing the order of registers as inputs to add (shouldn't matter), and one changing from comparing hi to comparing gt: Screenshot 2024-06-13 at 2 10 20 PM And yet somehow it's 25% faster (80µs -> 60µs).

Jun 13 '24 20:06 NHDaly

Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call. Now, it's inlined, ~and i think the whole loop is constant-folded away, so it's actually sort of cheating. 😅 But still, allowing that optimization seems like a positive thing!~. The inlining is the main change, i think, for FD{Int32}.

julia> function bench(fd) 
           for _ in 1:10000
               fd = fd * fd
           end
           fd
       end
bench (generic function with 1 method)

# Before:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1163               ; -- Begin function julia_bench_1163
        .p2align        2
_julia_bench_1163:                      ; @julia_bench_1163
; %bb.0:                                ; %guard_exit4
        sub     sp, sp, #48
        stp     x20, x19, [sp, #16]             ; 16-byte Folded Spill
        stp     x29, x30, [sp, #32]             ; 16-byte Folded Spill
        ldr     w0, [x0]
        mov     w19, #10000
Lloh0:
        adrp    x20, "_j_*_1165"@GOTPAGE
Lloh1:
        ldr     x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        str     w0, [sp, #12]
        add     x0, sp, #12
        add     x1, sp, #12
        blr     x20
        subs    x19, x19, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit16
        ldp     x29, x30, [sp, #32]             ; 16-byte Folded Reload
        ldp     x20, x19, [sp, #16]             ; 16-byte Folded Reload
        add     sp, sp, #48
        ret
        .loh AdrpLdrGot Lloh0, Lloh1
                                        ; -- End function
.subsections_via_symbols

After:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1161               ; -- Begin function julia_bench_1161
        .p2align        2
_julia_bench_1161:                      ; @julia_bench_1161
; %bb.0:                                ; %guard_exit7
        ldr     w0, [x0]
        mov     w8, #10000
        mov     x9, #57148
        movk    x9, #36175, lsl #16
        movk    x9, #28311, lsl #32
        movk    x9, #33554, lsl #48
        mov     x10, #-1000
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        umull   x11, w0, w0
        umulh   x11, x11, x9
        lsr     x12, x11, #9
        mul     x13, x12, x10
        umaddl  x13, w0, w0, x13
        ubfx    x11, x11, #9, #1
        cmp     x13, #500
        cset    w13, hi
        csel    w11, w11, w13, eq
        add     w0, w11, w12
        subs    x8, x8, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit19
        ret
                                        ; -- End function
.subsections_via_symbols

EDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code:

julia> @noinline function bench(fd, N) 
           for _ in 1:N
               fd = fd * fd
           end
           fd
       end
bench (generic function with 2 methods)

julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
  48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

Jun 13 '24 20:06 NHDaly

I think Tomas should be able to review when he's back from vacation. He's out til thursday. :) Thanks for the offer, @omus! :)

Jun 17 '24 03:06 NHDaly

Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks!

Jul 10 '24 01:07 NHDaly

I also updated the PR comment with the final perf numbers, which look 👌

Jul 10 '24 01:07 NHDaly

By the way, div/mod allocations are fixed in the most recent version of BitIntegers when running on 1.11+, see #48.

Sep 16 '24 22:09 EdsterG

Oh wow, very nice! Thanks @EdsterG. 💪

This should still help quite a bit, but the baseline for comparison would be better in 1.11, then! 💪

Sep 26 '24 20:09 NHDaly