latency: fix cat invalidations
This patch removes the invalidation on cat and thus reduces the OneHotArrays loading time from 4.5s to 0.5s (the normal status)
using OneHotArrays alone in a new Julia process needs about 0.57s. However, using LinearMaps before that makes the loading time of OneHotArrays to about 5s.
julia> @time using OneHotArrays # v0.2.4
0.569807 seconds (539.89 k allocations: 36.725 MiB, 4.48% compilation time)
# in a new Julia process
# LinearMaps master branch
julia> @time using LinearMaps
0.023610 seconds (6.77 k allocations: 501.328 KiB)
julia> @time using OneHotArrays
5.038436 seconds (13.72 M allocations: 514.807 MiB, 2.32% gc time, 0.51% compilation time)
With SnoopCompile I've noticed LinearMaps invalidates the cat methods and OneHotArrays extends the cat methods:
julia> using SnoopCompileCore
julia> invalidations = @snoopr begin
using LinearMaps
end
174-element Vector{Any}:
MethodInstance for Core.kwcall(::@NamedTuple{dims::Val{1}}, ::typeof(cat), ::Vector{Base.PkgId}, ::Vararg{Any})
"invalidate_mt_cache"
...
This patch tweaks the cat method extension by dispatching on _cat instead of cat, and somehow it does work:
# also a new Julia process
# on this PR branch
julia> @time using LinearMaps
0.012467 seconds (11.94 k allocations: 741.312 KiB)
julia> @time using OneHotArrays # 🎉 back to normal
0.576323 seconds (539.17 k allocations: 36.551 MiB, 4.37% compilation time)
P.S. I figure this is related to https://github.com/JuliaLang/julia/pull/45028 (because the Core.kwcall hint in the invalidations list) but I don't know why this works. Maybe @aviatesk or @timholy knows the magic behind?
Commenting out the Base.cat method in above link in OneHotArrays also reduces the latency, but I figure it's better to do it here since OneHotArrays looks more innocent while this packages code generates many cat methods in a seemingly wild way.
FWIW, if we reduce the number for k in 1:8 to something like for k in 1:4 the latency won't be that severe, too. As an experiment:
diff --git a/src/blockmap.jl b/src/blockmap.jl
index 72c2bb9..49ba024 100644
--- a/src/blockmap.jl
+++ b/src/blockmap.jl
@@ -497,7 +497,7 @@ BlockDiagonalMap(maps::LinearMap...) =
# since the below methods are more specific than the Base method,
# they would redefine Base/SparseArrays behavior
-for k in 1:8 # is 8 sufficient?
+for k in 1:16 # is 8 sufficient?
Is = ntuple(n->:($(Symbol(:A, n))::AbstractVecOrMatOrQ), Val(k-1))
# yields (:A1, :A2, :A3, ..., :A(k-1))
L = :($(Symbol(:A, k))::LinearMap)
This makes using OneHotArrays takes forever.
Codecov Report
Attention: Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
Project coverage is 99.43%. Comparing base (
c7a3c56) to head (0956320).
:exclamation: Current head 0956320 differs from pull request most recent head da9eded
Please upload reports for the commit da9eded to get more accurate results.
| Files | Patch % | Lines |
|---|---|---|
| src/blockmap.jl | 50.00% | 4 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #216 +/- ##
==========================================
+ Coverage 99.30% 99.43% +0.12%
==========================================
Files 22 22
Lines 1591 1595 +4
==========================================
+ Hits 1580 1586 +6
+ Misses 11 9 -2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
First I was afraid this would bring back #123 (see #124 for the offending method signature), but this works since you keep the method signature as is. Unfortunately, on my machine, this PR increases loading time from 0.05 seconds to about 7 seconds on v1.9!
Looks like this isn't the right fix. I'll try to see if there's any better solution then.
Interestingly, this seems to work well for Julia v1.10+. Unfortunately, load time of OneHotArrays.jl doubles from v1.10 to v1.11 and master even without any interference by LinearMaps.jl. But this PR increases load time of LinearMaps.jl by just a little bit, and decreases load time of OneHotArrays.jl after LinearMaps.jl dramatically (factor 8, roughly).
I like this, and I think we should push this even further: ride the generic call chain and catch somewhere much deeper, where there is almost no method competition. Doesn't work yet, but pushed to continue working on it from work tomorrow.