Arblib.jl icon indicating copy to clipboard operation
Arblib.jl copied to clipboard

A*A' significantly slower than A*A for ArbMatrices

Open nanleij opened this issue 4 years ago • 3 comments

Hi,

I was trying to see if I could use this instead of BigFloats for some of my computations, for which i (among other things like cholesky decompositions and computing the minimum eigenvalue) need to do A*A' for matrices of size up to about 300x300, which is currently the bottleneck.

I came across the following remarkable difference:

julia> A = ArbMatrix(rand(BigFloat,100,100));

julia> @benchmark $A*$A
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     128.750 ms (0.00% GC)
  median time:      132.684 ms (0.00% GC)
  mean time:        133.845 ms (0.00% GC)
  maximum time:     148.832 ms (0.00% GC)
  --------------
  samples:          38
  evals/sample:     1

julia> @benchmark $A*$A'
BenchmarkTools.Trial:
  memory estimate:  249.02 MiB
  allocs estimate:  4080001
  --------------
  minimum time:     3.125 s (22.31% GC)
  median time:      3.331 s (26.56% GC)
  mean time:        3.331 s (26.56% GC)
  maximum time:     3.537 s (30.31% GC)
  --------------
  samples:          2
  evals/sample:     1

So multiplying with a transpose is significantly slower than normal multiplication, maybe because of all the allocations which do happen in the multiplication with the transpose. Comparing it to BigFloats:

julia> B = rand(BigFloat,100,100);

julia> @benchmark $B*$B
BenchmarkTools.Trial:
  memory estimate:  218.35 MiB
  allocs estimate:  4100002
  --------------
  minimum time:     911.017 ms (11.87% GC)
  median time:      930.252 ms (11.95% GC)
  mean time:        933.218 ms (11.67% GC)
  maximum time:     962.344 ms (12.72% GC)
  --------------
  samples:          6
  evals/sample:     1

julia> @benchmark $B*$B'
BenchmarkTools.Trial:
  memory estimate:  218.35 MiB
  allocs estimate:  4100002
  --------------
  minimum time:     1.228 s (8.92% GC)
  median time:      1.237 s (9.04% GC)
  mean time:        1.238 s (8.64% GC)
  maximum time:     1.255 s (9.08% GC)
  --------------
  samples:          5
  evals/sample:     1

So for BigFloats, multiplying with the transpose is also a bit slower, but only 30% compared to about 2500%. Another way would be to use normal matrices with Arb entries:

julia> C = rand(Arb,100,100);

julia> @benchmark $C*$C
BenchmarkTools.Trial:
  memory estimate:  124.66 MiB
  allocs estimate:  2040002
  --------------
  minimum time:     1.572 s (14.81% GC)
  median time:      1.807 s (20.96% GC)
  mean time:        1.872 s (25.84% GC)
  maximum time:     2.237 s (37.54% GC)
  --------------
  samples:          3
  evals/sample:     1

julia> @benchmark $C*$C'
BenchmarkTools.Trial:
  memory estimate:  124.66 MiB
  allocs estimate:  2040002
  --------------
  minimum time:     2.014 s (11.50% GC)
  median time:      2.029 s (19.55% GC)
  mean time:        2.166 s (22.57% GC)
  maximum time:     2.456 s (34.14% GC)
  --------------
  samples:          3
  evals/sample:     1

So this is more comparable to the BigFloats, only a factor 2 difference in both cases. (Interestingly, it only uses about half the memory/allocations)

To me it seems like the ArbMatrices fall back to the generic matrix multiplication in the case with the transpose. Is there an easy way around this?

PS: I also tried Nemo (which has similar speed for A*A and A*A'), but I also need the minimum eigenvalue and the cholesky decomposition, which are currently not in Nemo. So I would need to convert back and forth to BigFloats, or compute those myself. With Nemo, I'm not sure how to do this but with Arblib the conversion is easy. Hence why I would very much like a similar speed for A*A' as for A*A.

nanleij avatar Apr 12 '21 10:04 nanleij

To me it seems like the ArbMatrices fall back to the generic matrix multiplication in the case with the transpose. Is there an easy way around this?

Arb only implements matrix multiplication A*B and no special versions where A or B are transposed. The easiest thing is probably to realize A' as an ArbMatrix and to then perform the multiplication. This is also how I would probably implement this here in Arblib. A pull request for this is very much welcomed 🙂

saschatimme avatar Apr 12 '21 11:04 saschatimme

julia> @btime A*A;
  11.988 ms (10089 allocations: 4.11 MiB)

julia> @btime A*A';
  786.593 ms (8160005 allocations: 436.25 MiB)

julia> C = similar(A); D = similar(A); @btime Arblib.mul!($D, $A, Arblib.transpose!($C, $A));
  12.785 ms (85 allocations: 3.20 MiB)

we're hitting the generic matmul here:

julia> @which A*A
*(A::T, B::T) where T<:Union{AcbMatrix, AcbRefMatrix, ArbMatrix, ArbRefMatrix} in Arblib at /home/kalmar/.julia/dev/Arblib/src/matrix.jl:166

julia> @which A*A'
*(A::AbstractMatrix{T} where T, B::AbstractMatrix{T} where T) in LinearAlgebra at /home/kalmar/packages/julias/julia-1.6/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:151

kalmarek avatar Apr 12 '21 12:04 kalmarek

A*ArbMatrix(A') works like a charm! Thanks! I might look into a pull request later

nanleij avatar Apr 12 '21 12:04 nanleij