Speed up * for FD{Int64} via custom `fldmod_by_const`
Add a custom implementation of fldmod(x::T,y) for when y is a const,
and call it whenever T >= Int128.
This improves performance for *(a::FD{Int64}, b::FD{Int64}), b/c * passes
the result of widemul(a,b) into fldmod -- and LLVM isn't able to
optimize fldmodinline(::Int128,::Const{Int128}) nearly as well as we'd like.
Here, we opt to use a custom implementation of fldmod in that case,
which runs significantly faster, making the multiply faster.
Specifically, the custom fldmod_by_const implementation is based off of the assembly generated by LLVM for fld(x,100). One big reason it can't emit similarly optimized code when x is an Int128 is because there isn't a single op to efficiently get the upper-half of the "split-widemul". This requires adding a function that does it, which I've called splitmul_upper.
Note that this PR only directly improves performance for FD{Int64}: FD{Int128} still widens to BigInt, and we do not use the custom implementation in that case.
I'm following this up with https://github.com/RelationalAI-oss/FixedPointDecimals.jl/pull/7, which changes Int128 to widen to a 256-bit integer implementation, so that we can use the custom fldmod for FD{Int128} as well! :)
Unfortunately I still haven't found a good solution for the automated benchmarks (i'm working on it!), but in the meantime, i've re-run the benchmark locally. Here's the updated timings for just the FixedDecimal multiplication:
| time (ns) | allocs | |
|---|---|---|
| FD{ Int32,2} | 2.09 | 0 |
| FD{ Int64,2} | 8.25 | 0 |
| FD{Int128,2} | 2186.16 | 31000069 |
Compared to master (after merging https://github.com/JuliaMath/FixedPointDecimals.jl/pull/43), this PR reduces the time for FD{Int64,2} by 50%, down from 16.61 ns here: https://github.com/JuliaMath/FixedPointDecimals.jl/pull/43#issue-236631643! :)
Coverage decreased (-3.2%) to 93.304% when pulling 11ee5e14899508dd706bfeeadf180dfb47c5c0da on RelationalAI-oss:rai-manual_mul_invcoeff into 315e5cb502b67aa5356d7ce56b0df94a00ec01c5 on JuliaMath:master.
Codecov Report
Merging #45 into master will decrease coverage by
3.12%. The diff coverage is87.09%.
@@ Coverage Diff @@
## master #45 +/- ##
==========================================
- Coverage 96.49% 93.36% -3.13%
==========================================
Files 1 2 +1
Lines 171 226 +55
==========================================
+ Hits 165 211 +46
- Misses 6 15 +9
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/FixedPointDecimals.jl | 95.85% <100%> (-0.64%) |
:arrow_down: |
| src/fldmod_by_const.jl | 85.96% <85.96%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 315e5cb...11ee5e1. Read the comment docs.
In this PR, we used the following check so that we'll only use the custom implementation of fldmod when it's needed, because LLVM can't do it itself:
https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45/files#diff-ea4e0198fdaaab3db0fa802fdb33c161R178
It might look a little weird as it's currently phrased, because only FD{Int64} would use it (when the Int64 widens to Int128, then we call fldmod(::Int128, 100)). In this PR, FD{Int128} doesn't get funneled into the fldmod_by_const, because Int128 widens to BigInt.
In the next PR (https://github.com/RelationalAI-oss/FixedPointDecimals.jl/pull/7), though, FD{Int128} uses the same implementation as well, because it widens to Int256, and then we call fldmod_by_const(Int256). Here are the performance results from that change: https://github.com/RelationalAI-oss/FixedPointDecimals.jl/pull/7#issuecomment-448774559.
Happy New Year you all! :) Hope you had a nice holiday. :)
I've cleaned this up a bit after some time away and I think i've addressed all the outstanding comments! :) Please take another look when you get time.
Cheers!
I'll try to re-review this week. I'll note the CI is failing on Julia 0.6.
(note: i've Resolved the threads I've addressed. Any remaining unresolved comments have questions that I'm waiting for feedback on. Thanks again for this thorough review!)
(One other change: I marked coefficient as Base.@pure, to ensure that it's always available at compile-time to insert into the Val() statement in fldmod_by_const.)
Without this, I was seeing type instability for f>=3.
Excellent news! We can now close this PR because Julia+LLVM now do this optimization themselves, automatically! 🎉
Since Julia 1.6, when *(::FD{Int64}, ::FD{Int64}) took ~8ns, now it takes 0.8 ns, and the fld-by-const is optimized away. 😎
Before:
After:
(The image is scaled to fit the browser width, but the total time is 10x less)