Finish removing the BigInts from * for FD{Int128}!
Finally implements the fast-multiplication optimization from https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45! :)
This PR optimizes multiplication for FixedDecimal{Int64} and FixedDecimal{Int128}. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉
This is a follow-up to https://github.com/JuliaMath/FixedPointDecimals.jl/pull/93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.
Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.
After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.
master:
julia> using FixedPointDecimals, BenchmarkTools
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
84.916 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
249.083 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
4.660 ms (248829 allocations: 4.70 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)
# unsigned
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
60.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
172.958 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
5.408 ms (308621 allocations: 6.14 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)
After PR #93:
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
85.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
248.958 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
4.673 ms (160798 allocations: 3.22 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)
# unsigned
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
60.041 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
173.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
4.750 ms (190708 allocations: 4.82 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)
After this PR:
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
48.458 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
57.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
90.708 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
180.125 μs (0 allocations: 0 bytes)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)
# unsigned
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
42.708 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
51.250 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
80.042 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)
julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
162.417 μs (0 allocations: 0 bytes)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)
Huzzah! I finally feel like we can bring the ideas from https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for FixedDecimal{Int64} is already fast). But this PR adds support for FixedDecimal{Int128}, which needed a custom function for Int256 div by const. :)
See here, for my note on FixedDecimal{Int64} support: https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45#issuecomment-2163836966
Now that there's julia's effects system, this doesn't need @pure, so it's actually possible to do this!
And now that we've introduced BitIntegers.jl, the code is way simpler. 😊 woohoo!
Update 1: It turned out that (of course) * is too fast for @btime to get a good measurement of on its own. I've adjusted the benchmark to perform repeated multiplies, and the true performance is now included in the original comment.
Update 2: It turns out that this is actually faster for all int types (>2x faster for FD{Int64}), even though LLVM was doing a version of this optimization on its own. I'm honestly pretty surprised by that... From what I can tell, for FD{Int32}, the code is identical except for two instruction changes: one changing the order of registers as inputs to add (shouldn't matter), and one changing from comparing hi to comparing gt:
And yet somehow it's 25% faster (80µs -> 60µs).
Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call. Now, it's inlined, ~and i think the whole loop is constant-folded away, so it's actually sort of cheating. 😅 But still, allowing that optimization seems like a positive thing!~. The inlining is the main change, i think, for FD{Int32}.
julia> function bench(fd)
for _ in 1:10000
fd = fd * fd
end
fd
end
bench (generic function with 1 method)
# Before:
julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_bench_1163 ; -- Begin function julia_bench_1163
.p2align 2
_julia_bench_1163: ; @julia_bench_1163
; %bb.0: ; %guard_exit4
sub sp, sp, #48
stp x20, x19, [sp, #16] ; 16-byte Folded Spill
stp x29, x30, [sp, #32] ; 16-byte Folded Spill
ldr w0, [x0]
mov w19, #10000
Lloh0:
adrp x20, "_j_*_1165"@GOTPAGE
Lloh1:
ldr x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1: ; %L2
; =>This Inner Loop Header: Depth=1
str w0, [sp, #12]
add x0, sp, #12
add x1, sp, #12
blr x20
subs x19, x19, #1
b.ne LBB0_1
; %bb.2: ; %guard_exit16
ldp x29, x30, [sp, #32] ; 16-byte Folded Reload
ldp x20, x19, [sp, #16] ; 16-byte Folded Reload
add sp, sp, #48
ret
.loh AdrpLdrGot Lloh0, Lloh1
; -- End function
.subsections_via_symbols
After:
julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_bench_1161 ; -- Begin function julia_bench_1161
.p2align 2
_julia_bench_1161: ; @julia_bench_1161
; %bb.0: ; %guard_exit7
ldr w0, [x0]
mov w8, #10000
mov x9, #57148
movk x9, #36175, lsl #16
movk x9, #28311, lsl #32
movk x9, #33554, lsl #48
mov x10, #-1000
LBB0_1: ; %L2
; =>This Inner Loop Header: Depth=1
umull x11, w0, w0
umulh x11, x11, x9
lsr x12, x11, #9
mul x13, x12, x10
umaddl x13, w0, w0, x13
ubfx x11, x11, #9, #1
cmp x13, #500
cset w13, hi
csel w11, w11, w13, eq
add w0, w11, w12
subs x8, x8, #1
b.ne LBB0_1
; %bb.2: ; %guard_exit19
ret
; -- End function
.subsections_via_symbols
EDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code:
julia> @noinline function bench(fd, N)
for _ in 1:N
fd = fd * fd
end
fd
end
bench (generic function with 2 methods)
julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)
I think Tomas should be able to review when he's back from vacation. He's out til thursday. :) Thanks for the offer, @omus! :)
Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks!
I also updated the PR comment with the final perf numbers, which look 👌
By the way, div/mod allocations are fixed in the most recent version of BitIntegers when running on 1.11+, see #48.
Oh wow, very nice! Thanks @EdsterG. 💪
This should still help quite a bit, but the baseline for comparison would be better in 1.11, then! 💪