Add a custom implementation of fldmod(x::T,y) for when y is a const, and call it whenever T >= Int128.

This improves performance for *(a::FD{Int64}, b::FD{Int64}), b/c * passes the result of widemul(a,b) into fldmod -- and LLVM isn't able to optimize fldmodinline(::Int128,::Const{Int128}) nearly as well as we'd like.

Here, we opt to use a custom implementation of fldmod in that case, which runs significantly faster, making the multiply faster.

Specifically, the custom fldmod_by_const implementation is based off of the assembly generated by LLVM for fld(x,100). One big reason it can't emit similarly optimized code when x is an Int128 is because there isn't a single op to efficiently get the upper-half of the "split-widemul". This requires adding a function that does it, which I've called splitmul_upper.

Note that this PR only directly improves performance for FD{Int64}: FD{Int128} still widens to BigInt, and we do not use the custom implementation in that case.

I'm following this up with https://github.com/RelationalAI-oss/FixedPointDecimals.jl/pull/7, which changes Int128 to widen to a 256-bit integer implementation, so that we can use the custom fldmod for FD{Int128} as well! :)

Dec 19 '18 22:12 NHDaly

Unfortunately I still haven't found a good solution for the automated benchmarks (i'm working on it!), but in the meantime, i've re-run the benchmark locally. Here's the updated timings for just the FixedDecimal multiplication:

	time (ns)	allocs
FD{ Int32,2}	2.09	0
FD{ Int64,2}	8.25	0
FD{Int128,2}	2186.16	31000069

Compared to master (after merging https://github.com/JuliaMath/FixedPointDecimals.jl/pull/43), this PR reduces the time for FD{Int64,2} by 50%, down from 16.61 ns here: https://github.com/JuliaMath/FixedPointDecimals.jl/pull/43#issue-236631643! :)

Dec 19 '18 22:12 NHDaly

Coverage decreased (-3.2%) to 93.304% when pulling 11ee5e14899508dd706bfeeadf180dfb47c5c0da on RelationalAI-oss:rai-manual_mul_invcoeff into 315e5cb502b67aa5356d7ce56b0df94a00ec01c5 on JuliaMath:master.

Dec 19 '18 22:12 coveralls

Codecov Report

Merging #45 into master will decrease coverage by 3.12%. The diff coverage is 87.09%.

@@            Coverage Diff             @@
##           master      #45      +/-   ##
==========================================
- Coverage   96.49%   93.36%   -3.13%     
==========================================
  Files           1        2       +1     
  Lines         171      226      +55     
==========================================
+ Hits          165      211      +46     
- Misses          6       15       +9

Impacted Files	Coverage Δ
src/FixedPointDecimals.jl	`95.85% <100%> (-0.64%)`	:arrow_down:
src/fldmod_by_const.jl	`85.96% <85.96%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 315e5cb...11ee5e1. Read the comment docs.

Dec 19 '18 22:12 codecov-io

In this PR, we used the following check so that we'll only use the custom implementation of fldmod when it's needed, because LLVM can't do it itself: https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45/files#diff-ea4e0198fdaaab3db0fa802fdb33c161R178

It might look a little weird as it's currently phrased, because only FD{Int64} would use it (when the Int64 widens to Int128, then we call fldmod(::Int128, 100)). In this PR, FD{Int128} doesn't get funneled into the fldmod_by_const, because Int128 widens to BigInt.

In the next PR (https://github.com/RelationalAI-oss/FixedPointDecimals.jl/pull/7), though, FD{Int128} uses the same implementation as well, because it widens to Int256, and then we call fldmod_by_const(Int256). Here are the performance results from that change: https://github.com/RelationalAI-oss/FixedPointDecimals.jl/pull/7#issuecomment-448774559.

Dec 19 '18 23:12 NHDaly

Happy New Year you all! :) Hope you had a nice holiday. :)

I've cleaned this up a bit after some time away and I think i've addressed all the outstanding comments! :) Please take another look when you get time.

Cheers!

Jan 03 '19 22:01 NHDaly

I'll try to re-review this week. I'll note the CI is failing on Julia 0.6.

Jan 08 '19 22:01 omus

(note: i've Resolved the threads I've addressed. Any remaining unresolved comments have questions that I'm waiting for feedback on. Thanks again for this thorough review!)

Feb 11 '19 21:02 NHDaly

(One other change: I marked coefficient as Base.@pure, to ensure that it's always available at compile-time to insert into the Val() statement in fldmod_by_const.)

Without this, I was seeing type instability for f>=3.

Feb 12 '19 23:02 NHDaly

Excellent news! We can now close this PR because Julia+LLVM now do this optimization themselves, automatically! 🎉

Since Julia 1.6, when *(::FD{Int64}, ::FD{Int64}) took ~8ns, now it takes 0.8 ns, and the fld-by-const is optimized away. 😎

Before: Screenshot 2024-06-12 at 2 21 39 PM

After: Screenshot 2024-06-12 at 2 22 00 PM

(The image is scaled to fit the browser width, but the total time is 10x less)

Jun 12 '24 20:06 NHDaly

Speed up * for FD{Int64} via custom `fldmod_by_const`

Codecov Report