LinearAlgebra.jl icon indicating copy to clipboard operation
LinearAlgebra.jl copied to clipboard

Using Gram matrices to speed up `opnorm`

Open njericha opened this issue 10 months ago • 8 comments

In Julia version 1.11.2, it appears as though I can speed up calculation of opnorm by taking advantage of the identity

$$ \left\lVert A^\top A \right\rVert_{op} = \left\lVert A A^\top \right\rVert_{op} = \left\lVert A \right\rVert_{op}^2. $$

See https://en.wikipedia.org/wiki/Operator_norm#Operators_on_a_Hilbert_space .

In the following test, we can speed up the calculation of the operator norm for tall matrices $A$ by using $\sqrt{\left\lVert A^\top A \right\rVert_{op}}$ (and presumably wide matrices $A$ by using $\sqrt{\left\lVert AA^\top \right\rVert_{op}}$). Note that wrapping the Gram matrix with Symmetric also reduces the memory needed.

using Random: randn
using LinearAlgebra: opnorm, Symmetric
using BenchmarkTools

fopnorm1(X) = opnorm(X)
fopnorm2(X) = sqrt(opnorm(X'X))
fopnorm3(X) = sqrt(opnorm(Symmetric(X'X)))

dimensions = (10000, 10)

b = @benchmark fopnorm1(X) setup=(X=randn(dimensions))
display(b)
b = @benchmark fopnorm2(X) setup=(X=randn(dimensions))
display(b)
b = @benchmark fopnorm3(X) setup=(X=randn(dimensions))
display(b)
BenchmarkTools.Trial: 6522 samples with 1 evaluation per sample.
 Range (min … max):  391.800 μs …   5.152 ms  ┊ GC (min … max): 0.00% … 84.40%
 Time  (median):     463.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   503.166 μs ± 153.189 μs  ┊ GC (mean ± σ):  4.63% ± 10.26%

   ▄▆▇██▇▇▆▄▃▂▂▁▁▁ ▁▂▂▁▁  ▁                                     ▂
  ▇█████████████████████████▇▇▇▆▅▆▆▅▄▆▃▅▃▅▄▄▄▁▃▄▄▅▅▆▇▇█▇▇▇██▇██ █
  392 μs        Histogram: log(frequency) by time       1.14 ms <

 Memory estimate: 787.58 KiB, allocs estimate: 13.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  82.300 μs … 483.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     97.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   99.002 μs ±  16.227 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▄▅▆▇▆▆▅▅▅▄▆▅█▅▆▅▃▃▂▂▁ ▁                                     ▂
  ███████████████████████████▆▇▇██▆█▅▇▇▇▇▆▇▇▇▇▇▆▆▇█▇█▇▆▇▇▇▅▅▄▅ █
  82.3 μs       Histogram: log(frequency) by time       163 μs <

 Memory estimate: 8.09 KiB, allocs estimate: 14.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  79.000 μs … 376.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     95.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   95.234 μs ±  12.877 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                █
  ▂▂▃▄▂▄▃▃▄▃▃▂▆▃█▅▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  79 μs           Histogram: frequency by time          149 μs <

 Memory estimate: 6.08 KiB, allocs estimate: 19.

This trick also works for square matrices, although with possibly more memory requirements.

dimensions = (100, 100)

b = @benchmark fopnorm1(X) setup=(X=randn(dimensions))
display(b)
b = @benchmark fopnorm2(X) setup=(X=randn(dimensions))
display(b)
b = @benchmark fopnorm3(X) setup=(X=randn(dimensions))
display(b)
BenchmarkTools.Trial: 7594 samples with 1 evaluation per sample.
 Range (min … max):  531.600 μs …   5.870 ms  ┊ GC (min … max): 0.00% … 88.93%
 Time  (median):     597.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   623.509 μs ± 133.170 μs  ┊ GC (mean ± σ):  1.44% ±  5.50%

     ▂█▇▃
  ▂▂▄█████▇▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▂▁▁▂▂▂▁▁▂▂▂▂▂ ▃
  532 μs           Histogram: frequency by time         1.21 ms <

 Memory estimate: 137.97 KiB, allocs estimate: 14.
BenchmarkTools.Trial: 6847 samples with 1 evaluation per sample.
 Range (min … max):  574.900 μs …   8.099 ms  ┊ GC (min … max): 0.00% … 91.36%
 Time  (median):     669.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   695.524 μs ± 163.002 μs  ┊ GC (mean ± σ):  2.15% ±  6.82%

     ▃▄▆█▇▄
  ▂▃████████▆▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  575 μs           Histogram: frequency by time         1.41 ms <

 Memory estimate: 216.17 KiB, allocs estimate: 17.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  308.100 μs …   5.089 ms  ┊ GC (min … max): 0.00% … 90.75%
 Time  (median):     366.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   383.650 μs ± 123.907 μs  ┊ GC (mean ± σ):  2.79% ±  7.16%

    ▃▄█▄
  ▄▇████▇▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▂▂▂ ▃
  308 μs           Histogram: frequency by time          1.1 ms <

 Memory estimate: 194.64 KiB, allocs estimate: 24.

More testing might be needed to see if this performance boost still works for other types of arrays like sparce matrices, and element types like complex. But it may be worth redefining opnorm if this the performance can be reliably achieved.

njericha avatar Jan 27 '25 23:01 njericha

this would only be true for opnorm(X, p = 2) right? The opnorm function also supports other values of p namely p=1 and p=Inf

ajinkya-k avatar Jan 29 '25 21:01 ajinkya-k

this would only be true for opnorm(X, p = 2) right? The opnorm function also supports other values of p namely p=1 and p=Inf

Yes. The opnorm for other ps are arguably easier to compute https://en.wikipedia.org/wiki/Operator_norm#Table_of_common_operator_norms

njericha avatar Jan 29 '25 21:01 njericha

One potential concern could be accuracy. If you insert additional steps, then each comes with its own condition number and adds to the overall error accumulation. What comes to my mind is solving tall linear systems Ax = b, which you could solve via A^T Ax=A^T b, but we don't do that, because the multiplication A^T A may be poorly conditioned, and the final result will be worse than solving it with QR. I'm not sure this transfers directly, but just a thought.

dkarrasch avatar Jan 30 '25 09:01 dkarrasch

I think the extra error is less of a concern here. The use of A'A ruins the smaller singular values but doesn't affect the larger values much.

julia> Q, _ = qr(randn(100,100));

julia> B = Q*Diagonal(exp10.(range(-200, stop=0, length = size(Q, 1))))*Q';

julia> sqrt(opnorm(B'B))
0.9999999999999997

I've tried tried a few things and couldn't make this version break down. In contrast, this approach would be fatal for cond

julia> sqrt(cond(B'B))
1.259066867835988e10

andreasnoack avatar Jan 30 '25 11:01 andreasnoack

I ran this:


lopn = zeros(0)
lopg = zeros(0)

@showprogress for p in 100:100:10000
    dimensions = (10000, p)
    X = randn(dimensions) 
    append!(lopn, opnorm(X))
    append!(lopg, fopnorm3(X))
end

and this is what i got for the absolute difference between the two (i cut off the loop around 45 because it was taking too long):

Image

ajinkya-k avatar Jan 30 '25 15:01 ajinkya-k

If you want to run this kind of empirical test, I would use X = randn(Float32, dimensions) to do the calculation in single precision, and compare to opnorm(Float64.(X)) as the "exact" values. (But the problem with testing on randn matrices is that they are all well-conditioned, so they aren't representative of cases where $A^T A$ is potentially problematic.)

stevengj avatar Jan 30 '25 15:01 stevengj

@stevengj Thanks for the insight! I will keep that in mind! What's a good library to get ill conditioned matrices? The only one I know is this: https://github.com/JuliaLinearAlgebra/MatrixDepot.jl

ajinkya-k avatar Jan 30 '25 16:01 ajinkya-k

I've used this code in the past:

# generate a random (Haar-uniform) m×n real orthogonal-column matrix
# using algorithm adapted from https://arxiv.org/abs/math-ph/0609050
function randQ(m::Integer,n::Integer)
    m ≥ n || throw(ArgumentError("matrix must be tall"))
    QR = qr(randn(m,n))
    return QR.Q * Diagonal(sign.(diag(QR.R)))
end

# random m×n matrix with condition number κ and power-law singular values.
function randcond(m::Integer,n::Integer, κ::Real)
    κ ≥ 1 || throw(ArgumentError("κ=$κ should be ≥ 1"))
    r = min(m,n)
    σ = exp.(range(0, -log(κ), length=r))
    U = randQ(m,r)
    V = randQ(n,r)
    return U * Diagonal(σ) * V'
end

stevengj avatar Jan 30 '25 20:01 stevengj

I think the extra error is less of a concern here. The use of A'A ruins the smaller singular values but doesn't affect the larger values much.

Right, it seems fine to me at first glance.

fopnorm3(X) = sqrt(opnorm(Symmetric(X'X)))

Note that this should be Hermitian, not Symmetric.

stevengj avatar Aug 14 '25 00:08 stevengj

Two things to be careful of in this approach are overflow/underflow and types:

  1. A'A might overflow/underflow. Similar to how we compute the L2 norm, you should first check maximum(abs, A) — if the result is too large or too small, you'll want to rescale the matrix before multiplying, and then scale the resulting norm.
  2. you'll want to compute A'A using the type of the result of opnorm. For example, you could use B = float(A); B'B — otherwise, you could overflow for e.g. integer matrices.

stevengj avatar Aug 14 '25 01:08 stevengj

Another speedup: when computing opnorm(A'A), you can exploit the fact that A'A is positive-semidefinite to call eigvals(Hermitian(A'A), n:n), which only computes the most-positive eigenvalue. For a 1000x1000 matrix, this is about 30% faster than computing all the eigenvalues and picking the largest-magnitude one (which is what opnorm(::Hermitian) does).

stevengj avatar Aug 15 '25 13:08 stevengj