Hecke.jl Speeding up "time to first result"?

Speeding up "time to first result"?

Open fingolfin opened this issue 1 year ago • 12 comments

On my M1 Mac with Julia 1.8-rc1:

julia> @time @eval maximal_order(quadratic_field(-1)[1]);
  5.577463 seconds (2.85 M allocations: 140.472 MiB, 0.34% gc time, 99.97% compilation time)

julia> @time @eval maximal_order(quadratic_field(-1)[1]);
  0.000288 seconds (121 allocations: 4.948 KiB)

This is not a simple matter of caching as it also is fast with different arguments, e.g.

julia> @time @eval maximal_order(quadratic_field(-3)[1]);
  0.003174 seconds (635 allocations: 30.745 KiB)

Most of this time seems to be spent in compiling. Indeed, starting Julia with -O1 things are much faster:

julia> @time @eval maximal_order(quadratic_field(-1)[1]);
  2.370229 seconds (554.39 k allocations: 25.480 MiB, 99.94% compilation time)

I've used @snoopl from SnoopCompile to get some insights into this. So I did

@snoopl "func_names.csv" "llvm_timings.yaml" begin
    using Hecke
    maximal_order(quadratic_field(-1)[1])
end

The resulting data files can be loaded separately:

julia> times, info = SnoopCompile.read_snoopl("func_names.csv", "llvm_timings.yaml");

julia> sort(times)[end][1]   # where is most time spent?
0.428102083

julia> sort(times)[end-1][1]  # second worst place?
0.068432708

julia> sum(k for (k,v) in times)
5.172362951000003

So yeah, we spend a lot of time in compilation. It compiles 999 method instances, it seems.

Unfortunately there don't seem to be many functions to help analyze this further...

Attached is the list of method instances, sorted alphabetically; perhaps this already gives some ideas... methods.txt

Jul 08 '22 15:07 fingolfin

As @thofma pointed out: it may well be that in the end, we can't do anything about this other than building a custom baseimage. Quite true, but perhaps @noinline and @nospecialize or some other tricks can also help at least a bit?

Jul 08 '22 15:07 fingolfin

@fingolfin @fieker There is a PR over at julia, and someone (don't want to ping them) used Hecke as a benchmark: https://github.com/JuliaLang/julia/pull/44527#issuecomment-1283106385. The improved timings are impressive.

P.S.: Caching the compiled stuff will yield

/Users/mose/.julia/compiled/v1.9/Hecke/lhBAK_LpNp2.dylib
-rwxr-xr-x  1 mose  staff   131M 20 Oct 00:39

Amazing how much stuff we have.

Oct 29 '22 09:10 thofma

P.S.: Caching the compiled stuff will yield
/Users/mose/.julia/compiled/v1.9/Hecke/lhBAK_LpNp2.dylib
-rwxr-xr-x  1 mose  staff   131M 20 Oct 00:39 
Amazing how much stuff we have.

People are looking into reducing the bloat. From what I heard, usually most of those libraries include serialised types in the data section, the machine code is a fraction of the whole library.

Oct 29 '22 15:10 giordano

Just tried https://github.com/JuliaLang/julia/pull/47184 and with it, I get:

du -hs ~/.julia/compiled/v1.10/Hecke/*
 79M	/Users/mhorn/.julia/compiled/v1.10/Hecke/lhBAK_YE8QW.dylib
5.2M	/Users/mhorn/.julia/compiled/v1.10/Hecke/lhBAK_YE8QW.ji
 79M	/Users/mhorn/.julia/compiled/v1.10/Hecke/lhBAK_kKnCR.dylib
5.2M	/Users/mhorn/.julia/compiled/v1.10/Hecke/lhBAK_kKnCR.ji

which is already better (and in the end, who cares). It's really amazing how it makes things. Indeed:

julia> @time @eval maximal_order(quadratic_field(-1)[1]);
  0.060866 seconds (81.06 k allocations: 5.365 MiB, 111.94% compilation time)

julia> @time @eval maximal_order(quadratic_field(-1)[1]);
  0.000282 seconds (124 allocations: 5.183 KiB)

So instead of 5.5 seconds for the first call, it's now 0.06 seconds. Fantastic!

Unfortunately, we don't benefit as much in Oscar, due to CxxWrap: if I first load Hecke and then load CxxWrap, we get this:

julia> using CxxWrap

julia> @time @eval maximal_order(quadratic_field(-1)[1]);
  3.255026 seconds (3.88 M allocations: 259.868 MiB, 1.72% gc time, 115.45% compilation time: 85% of which was recompilation)

I.e. CxxWrap invalidates tons of stuff. I've reported this before at the CxxWrap repository, but zero reaction so far. I think we'll have to handle this ourselves if want to see it improved... AFAICT most (?) of the overhead comes from the TONS of automagic convert methods in CxxWrap several of which apply far too generically. If we removed or restricted many of these, I think we'd get better results. Unfortunately this might break some CxxWrap users (including ours).... And in the end, it might not be possible w/o stripping a major part of the CxxWrap functionality out (all those automagic cconvert calls...)

Another hypothetical "solution" would be to have AbstractAlgebra and GAP.jl depend on CxxWrap and also load it -- of course nobody wants that, I am just saying it would solve the issue because then all those invalidation introduced by CxxWrap would have happened at precompile time for all (?) our packages (I verified that it does).

Dec 07 '22 08:12 fingolfin

Indeed, starting Julia with -O1 things are much faster:

You can force your module to always do -O1, (or even -O0 and/or --compile=min) if helpful. It helps with TTFX but would you be worried about speed after that with lower optimization? Is the speed low on the default setting or as far as you know on that non-default? I'm not sure about this package, but is it mostly used interactively, and even 2x slower (not saying that will happen, just a guess or rather asking for your tolerance for slowdown) after first use would be ok? The PR for it is trivial (see how done in e.g. Plots.jl or ask me, maybe I do a PR; you may though need to add for some or all of your dependencies in addition or instead). It's great to see 92x faster for first use because of the new PR in Base! Still:

Unfortunately, we don't benefit as much in Oscar, due to CxxWrap [..] I.e. CxxWrap invalidates tons of stuff

As you describe the order of using affect/cause invalidations.

I don't know for sure about invalidations, I just have a feeling it has a lot to do with inlining of code, which is the default on -O2. You can check just that with:

julia -O2 --inline=no

I don't know about the different levels of optimization e.g. -O1 (I don't think it's even documented what they mean exactly, maybe in LLVM, it might only be passed over to it). At that level, or at least -O0 I'm though pretty sure inlining is off. Possibly always off for lowest, otherwise the threshold lowered. Compilation needs not at all be slow (it's not in fully non-optimizing compilers). With no inlining I have a hard time seeing a reason for invalidations.

However the best optimization implies "whole program optimization", i.e. inlining, and that means compilation is always going to be slow, unless resulting machine code stored. We are getting there now in Julia, for packages and sysimage, but I think individually, and that rules out inlining across packages (as e.g. for Python packages written in C, which seem plenty fast enough). I'm not sure if the new package forbids such inlining, or more likely is some cases still has to reoptimize/invalidate because of the inlining your default optimization allows.

Dec 07 '22 10:12 PallHaraldsson

I am working on various fixes for CXxWrap which considerably improve the situation for me. See this comment for an overview.

As a datapoint, the timings printed by

using Hecke; using CxxWrap ; @time @eval maximal_order(quadratic_field(-3)[1]);

gives the following timings (in seconds) and allocations, when run with various Julia and CxxWrap pull requests, and with latest AbstractAlgebra DEV (which has some invalidation fixes).

Note that timings fluctuate quite a bit, and I did just two measurements each (the first one triggered precompilation and "warmed the caches" and I then noted down the second timings). So only take the relative order of magnitude into account. The number of allocations of course is stable, and thus perhaps a better proxy for the actual impact.

	Julia 1.8 seconds	Julia 1.8 allocations	Julia master (a40e24e1) seconds	Julia master allocations
CxxWrap main	7.565491	10.49 M	1.996047	2.50 M
with #335	7.554132	10.43 M	2.054217	2.50 M
with #337	7.382975	10.49 M	1.991450	2.50 M
with #338	7.069640	3.37 M	1.190185	1.32 M
with #335 + #337	7.550222	10.52 M	2.032760	2.50 M
with #335 + #338	6.759887	4.76 M	0.025311	74.92 k
with #337 + #338	6.720705	3.37 M	1.218763	1.32 M
with all three	6.944414	4.77 M	0.026655	74.92 k

So https://github.com/JuliaInterop/CxxWrap.jl/pull/338 has a major impact across the board, only together with https://github.com/JuliaInterop/CxxWrap.jl/pull/335 does it achieve its full potential. While https://github.com/JuliaInterop/CxxWrap.jl/pull/337 seems not that important (but I think this depends on what exactly you measure -- it definitely does get rid of invalidations).

Interestingly (and annoyingly) with Julia 1.8.4, it seems better to leave out https://github.com/JuliaInterop/CxxWrap.jl/pull/335 ... I have no idea so far why (but I also haven't made any efforts to find out so far).

Dec 28 '22 23:12 fingolfin

I'm a bit confused: in https://github.com/JuliaLang/julia/pull/44527#issuecomment-1283106385 I got a time of the order of 0.02 seconds without having to do anything, what changed since?

Dec 28 '22 23:12 giordano

You got those numbers without the using CxxWrap.

Dec 28 '22 23:12 fingolfin

Oh, cool 😄 So your work is basically to make sure loading also CxxWrap doesn't defeat all the gains from native code caching?

Dec 28 '22 23:12 giordano

@giordano indeed. I've long had my sights on CxxWrap as likely culprit for slowdowns / invalidations, see https://github.com/JuliaInterop/CxxWrap.jl/issues/278 -- but past attempt had mixed to now effect on timings (as my table above illustrates ;-) ). Now with the native code caching it suddenly becomes possibly to benefit, yay!

Dec 29 '22 01:12 fingolfin

Just had a look at this again. Sadly the situation got worse again it seems.

With Julia 1.10.3, Hecke master and CxxWrap 0.15.1:

using Hecke ; @time @eval maximal_order(quadratic_field(-3)[1]); reports

  0.010076 seconds (16.18 k allocations: 1.077 MiB, 91.93% compilation time)

using Hecke; using CxxWrap ; @time @eval maximal_order(quadratic_field(-3)[1]); reports

  6.100303 seconds (19.22 M allocations: 1.274 GiB, 9.22% gc time, 99.88% compilation time: 48% of which was recompilation)

So I inspected the output of this:

using Hecke
using SnoopCompileCore
invalidations = @snoopr begin using CxxWrap end
using SnoopCompile
trees = invalidation_trees(invalidations)

which reports thousands of invalidations. I disabled the top method causing these and repeated this process one time. Afterwards the timings were fine again.

For reference here are the changes I made to CxxWrap to avoid the invalidations in this specific example (not claiming I covered all):

diff --git a/src/CxxWrap.jl b/src/CxxWrap.jl
index 421bffe..d902fd4 100644
--- a/src/CxxWrap.jl
+++ b/src/CxxWrap.jl
@@ -118,7 +118,6 @@ Base.AbstractFloat(x::CxxNumber) = Base.AbstractFloat(to_julia_int(x))
 # Convenience constructors
 (::Type{T})(x::Number) where {T<:CxxNumber} = reinterpret(T, convert(julia_int_type(T), x))
 (::Type{T})(x::Rational) where {T<:CxxNumber} = reinterpret(T, convert(julia_int_type(T), x))
-(::Type{T1})(x::T2) where {T1<:Number, T2<:CxxNumber} = T1(reinterpret(julia_int_type(T2), x))::T1
 (::Type{T1})(x::T2) where {T1<:CxxNumber, T2<:CxxNumber} = T1(reinterpret(julia_int_type(T2), x))::T1
 Base.Bool(x::T) where {T<:CxxNumber} = Bool(reinterpret(julia_int_type(T), x))::Bool
 (::Type{T})(x::T) where {T<:CxxNumber} = x
@@ -137,7 +136,6 @@ end
 
 # Enum type interface
 abstract type CppEnum <: Integer end
-(::Type{T})(x::CppEnum) where {T <: Integer} = T(reinterpret(Int32, x))::T
 (::Type{T})(x::Integer) where {T <: CppEnum} = reinterpret(T, Int32(x))
 (::Type{T})(x::T) where {T <: CppEnum} = x
 import Base: +, |

Of course I am also not saying making this change to CxxWrap is a solution (it probably would break functionality), not even that CxxWrap is at fault (I dunno who is -- CxxWrap, Hecke, Julia stdlib, ...)

May 06 '24 11:05 fingolfin

For the record, here are the top invalidations:

 inserting (::Type{T1})(x::T2) where {T1<:Number, T2<:Union{CxxSigned, CxxUnsigned}} @ CxxWrap.CxxWrapCore ~/Projekte/Julia/packages/JuliaInterop/CxxWrap.jl/src/CxxWrap.jl:121 invalidated:
   mt_backedges:  1: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for findnext(::typeof(Base.JuliaSyntax._has_nested_error), ::AbstractString, ::Integer) (0 children)
                  2: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for resize!(::BitVector, ::Integer) (0 children)
                  3: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for findnext(::Hecke.var"#4569#4571", ::AbstractString, ::Integer) (0 children)
                  4: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for findnext(::Hecke.var"#4250#4251", ::AbstractString, ::Integer) (0 children)
                  5: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for Nemo._generic_power(::AbsSimpleNumFieldOrderIdeal, ::Integer) (0 children)
                  6: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for findnext(::Hecke.var"#4260#4261", ::AbstractString, ::Integer) (0 children)
                  7: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for findnext(::Hecke.var"#1995#1997", ::AbstractString, ::Integer) (0 children)
                  8: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for findnext(::Hecke.var"#4253#4254", ::AbstractString, ::Integer) (0 children)
                  9: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for Base._array_for(::Type{T}, ::Base.HasLength, ::Integer) where T<:AbsNumFieldOrderIdeal (0 children)
                 10: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for unsafe_load(::Ptr{UInt64}, ::Integer) (1 children)
                 11: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for SubString(::AbstractString, ::Int64, ::Integer) (1 children)
                 12: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for pointer(::String, ::Integer) (2 children)
                 13: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for unsafe_store!(::Ptr{UInt16}, ::UInt16, ::Integer) (2 children)
                 14: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for unsafe_store!(::Ptr{UInt8}, ::UInt8, ::Integer) (2 children)
                 15: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for unsafe_store!(::Ptr{UInt32}, ::UInt32, ::Integer) (2 children)
                 16: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for Base.Iterators.partition(::UnitRange{Int64}, ::Integer) (3 children)
                 17: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for Base.to_shape(::Base.OneTo) (27 children)
                 18: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for convert(::Type{Int64}, ::Integer) (596 children)
                 19: signature Tuple{Type{UInt64}, Integer} triggered MethodInstance for convert(::Type{UInt64}, ::Integer) (3369 children)
   backedges: 1: superseding Integer(x::Integer) @ Core boot.jl:817 with MethodInstance for Integer(::Integer) (2 children)

Of particular note:

convert(::Type{Int64}, ::Integer) (596 children)
convert(::Type{UInt64}, ::Integer) (3369 children)

I then tried to use JET to find out more:

julia> using JET

julia> ascend(trees[end].mt_backedges[end][2])
Choose a call for analysis (q to quit):
     convert(::Type{UInt64}, ::Integer)
       cconvert(::Type{UInt64}, ::Integer)
         _growend!(::Vector, ::Integer)
           append!(::Vector, ::AbstractVector)
 >           _unit_group_generators_quaternion(::Union{Hecke.AlgAssAbsOrd, Hecke.AlgAssRelOrd})
               _unit_group_generators_maximal_simple(::Any)
                 __unit_reps_simple(::Any, ::Hecke.AlgAssAbsOrdIdl{StructureConstantAlgebra{QQFieldElem}, AssociativeAlgebraElem{QQFieldElem, StructureConstantAlgebra{QQFieldElem}}})
...

which suggests a type stability issue in Hecke (for this very first instance): in the call to append! it can't deduce the type of the second argument well enough, hence it can't prove that length applied to that second argument will produce an Int, hence it goes through general dispatch.

The calling function itself looks like this:

function _unit_group_generators_quaternion(O::Union{AlgAssRelOrd, AlgAssAbsOrd})
  gens1 = unit_group_modulo_scalars(O)
  u, mu = unit_group(base_ring(O))
  A = algebra(O)
  gens2 = [ O(A(elem_in_nf(mu(u[i])))) for i in 1:ngens(u) ]
  return append!(gens1, gens2)  # FIXME: generic dispatch
end

but unfortunately it doesn't have the proper type for O; then can't deduce the type of u and hence can't prove that ngens(u) is an Int...

Tracing further up, the type instability begins inside a call to _find_quaternion_algebra(::AbsSimpleNumFieldElem, ::Vector{AbsSimpleNumFieldOrderIdeal}, ::Vector{InfPlc}) (the arguments here are all concrete and fine) which then calls is_principal_with_data(::Hecke.AlgAssAbsOrdIdl) which is not fine:

        fl, x = is_principal_with_data(q * prod(__P[i]^Int(c.coeff[i]) for i in 1:length(__P)))

I'll dig some more, but it seems we are taxing JET (?) quite a bit here, it is slow as molasses.

May 06 '24 11:05 fingolfin

Hecke.jl Hecke.jl copied to clipboard

Speeding up "time to first result"?

Hecke.jl
Hecke.jl copied to clipboard