MixedModels.jl [wIP] Allow TriangularRFP for diagonal blocks of L [ci skip]

Allow for blocks on the diagonal of L to be TriangularRFP from https://github.com/JuliaLinearAlgebra/RectangularFullPacked.jl
In the process of making these changes I found that I needed to be more explicit about when a diagonal block is being treated as Hermitian and stored in the lower triangle (copyscaleinflate! and rankUpdate!) and when it is being lower triangular (after cholUnblocked! and in the rdiv! call).
I had previously been playing somewhat fast and loose with these distinctions and just passing around Matrix objects for the dense case but that can trip you up.
The timings shown below indicate that there is a considerable speed penalty for TriangularRFP. This is more of a slowdown in the rankUpdate! method, which involves many calls to setindex!, rather than the cholUnblocked! method.
It may be possible to speed up some of the calls to setindex! because of the way the loops in the rankUpdate! work.
There is a new named argument, RFPthreshold, in the LinearMixedModel constructor (but not yet in MixedModel).
Right now this has a [ci skip] designation because I haven't updated all the tests to the more stringent requirements on Hermitian and Triangular types.
Tests should now pass (but apparently there are failures with some configurations).
Fixes #813

Mar 30 '25 17:03 dmbates

Some timings on my laptop

(jl_6i3bqi) pkg> st
Status `/private/var/folders/d9/t94kd5m50rl1r3bchncn62sm0000gn/T/jl_6i3bqi/Project.toml`
  [ff71e718] MixedModels v4.34.0

julia> dat = MixedModels.dataset(:insteval)
Arrow.Table with 73421 rows, 7 columns, and schema:
 :s        String
 :d        String
 :dept     String
 :studage  String
 :lectage  String
 :service  String
 :y        Int8

julia> f = @formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept));

julia> m1 = fit(MixedModel, f, dat; progress=false);

julia> @be objective(updateL!($m1))
Benchmark: 20 samples with 1 evaluation
 min    5.143 ms (38 allocs: 1.094 KiB)
 median 5.172 ms (38 allocs: 1.094 KiB)
 mean   5.176 ms (38 allocs: 1.094 KiB)
 max    5.270 ms (38 allocs: 1.094 KiB)

# using the db/RFP branch

julia> m1 = fit(MixedModel, f, dat; progress=false);

julia> BlockDescription(m1)
rows:      s             d            dept         fixed     
2972:   Diagonal   
1128:    Sparse      Diag/TrRFP  
  14:    Dense      Sparse/Dense   Diag/Dense  
   3:    Dense         Dense         Dense         Dense     


julia> @be objective(updateL!($m1))
Benchmark: 4 samples with 1 evaluation
        26.432 ms (31 allocs: 768 bytes)
        26.525 ms (31 allocs: 768 bytes)
        26.669 ms (31 allocs: 768 bytes)
        26.937 ms (31 allocs: 768 bytes)

julia> m2 = LinearMixedModel(f, dat; RFPthreshold=2000);

julia> BlockDescription(m2)
rows:      s             d            dept         fixed     
2972:   Diagonal   
1128:    Sparse      Diag/Dense  
  14:    Dense      Sparse/Dense   Diag/Dense  
   3:    Dense         Dense         Dense         Dense     


julia> @be objective(updateL!($m2))
Benchmark: 20 samples with 1 evaluation
 min    5.144 ms (28 allocs: 656 bytes)
 median 5.173 ms (28 allocs: 656 bytes)
 mean   5.173 ms (28 allocs: 656 bytes)
 max    5.256 ms (28 allocs: 656 bytes)

Mar 30 '25 17:03 dmbates

There may be a way of cutting down on the cost of the rankUpdate! operation that evaluates L[2,2] - L[2,1] * L[2,1] , which accounts for the largest amount of time spend in updateL! when TriangularRFP is used. The bottleneck is not evaluating each update; it is writing it back to the TriangularRFP matrix with setindex!. Because we are storing the lower triangle in the RFP format, we could use linear indexing to the underlying matrix for the first half of the columns of L[2,2]. The [i,j] entry in L[2,2] is exactly the [i,j] entry of L[2,2].parent, if the size of L[2,2] is odd, or the [i + 1, j] entry of L[2,2].parent, if the size is even.

The position of the elements in the lower right triangle of L[2,2] is more complicated to determine but we could either code that up or just fall back on the existing setindex! method. Furthermore, we could sort the rows of L[2,1] and L[2,2] by descending order of A[2,2].diag so that more of the updates are in the left half of L[2,2].

Does this seem worth trying, @palday @ajinkya-k?

I am willing to work on it and feel that it could be done relatively quickly but experience suggests that my "relatively quickly" may not be as quick as I imagine.

Apr 01 '25 15:04 dmbates

seems worth trying for sure.

Apr 01 '25 15:04 ajinkya-k

I agree it's worth trying!

Apr 02 '25 03:04 palday

~Isn’t there an SYRK for rectangular full packed format?~

Apr 02 '25 03:04 ajinkya-k

Sorry I totally misunderstood the code (https://github.com/JuliaLinearAlgebra/RectangularFullPacked.jl/blob/b159bfabd981c46b8842ff1ad857aa95f6b52cad/src/hermitian.jl#L43)

The setindex slowness is only for the case where the argument A is a SparseMatrixCSC correct?

Apr 03 '25 12:04 ajinkya-k

@ajinkya-k I should have been more clear that I was speaking of the case where L[2,1] is sparse. Using RFP for L[2,2] only makes sense if L[2,2] is very large and, by design, L[2,1] is even larger (because L[1,1] is as large as possible). If L[2,1] needs to be stored as a dense matrix you have already got a much bigger problem to contend with.

Apr 03 '25 14:04 dmbates

No I think that was clear but I twisted myself into knots trying to read the file changes from the PR

Apr 03 '25 15:04 ajinkya-k

Codecov Report

Attention: Patch coverage is 84.76821% with 23 lines in your changes missing coverage. Please review.

Project coverage is 96.73%. Comparing base (35747ac) to head (09ad439).

Files with missing lines	Patch %	Lines
src/remat.jl	54.54%	20 Missing :warning:
src/linalg/rankUpdate.jl	91.66%	2 Missing :warning:
src/pca.jl	83.33%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #821      +/-   ##
==========================================
- Coverage   97.33%   96.73%   -0.60%     
==========================================
  Files          36       36              
  Lines        3495     3588      +93     
==========================================
+ Hits         3402     3471      +69     
- Misses         93      117      +24

Flag	Coverage Δ
current	`96.37% <84.10%> (-0.62%)`	:arrow_down:
minimum	`96.73% <84.76%> (-0.60%)`	:arrow_down:
nightly	`96.33% <84.66%> (-0.60%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Apr 09 '25 18:04 codecov[bot]

@dmbates can you add a "fixes #813" to the PR description so that the issue is automatically closed when this is merged?

Apr 09 '25 18:04 ajinkya-k

So I have a question about strategy. The bottleneck when using TriangularRFP for (usually) the [2,2] block of L which implies that the [2,1] block will be a SparseMatrixCSC is the rankUpdate! operation. This is slow because every assignment like C[i,j] += A[j, k] * A[i, k] has to go through a complex calculation of mapping the virtual indices [i,j] to the Cartesian indices in C.data.

When it is the lower triangle of C being stored in C.data (and C.transa is false, which is always the case for us) the first half of the columns of C are either in their original positions in C.data or shifted down by one position.

I did a calculation of the number of updates in this nonflipped configuration versus the number in the flipped configuration for the insteval model. It was about 3/4 nonflipped and 1/4 flipped, which is roughly the proportion you would expect based on the sizes of the chunks in C.

(980444, 306131)

However, if we reorder the rows of A by decreasing row sums

julia> MixedModels.countinds(L[p, :])
(1221833, 64742)

we get about 95% non-flipped and that certainly seems worth it. So, at what point should the levels of the second grouping factor be reordered?

I think it will be rare to need to use RFP but, if L[2,1] is a SparseMatrixCSC there wouldn't be much harm in reordering the rows (i.e. permute the levels of the second grouping factor).

Is this explanation sufficiently intelligible to be able to offer an opinion?

Apr 10 '25 17:04 dmbates

@dmbates I think that change makes sense. I also wonder if this would slightly speed up things for the non RFP case --- I've certainly seen this type of optimization have big impacts in other code.

If I'm thinking about this correctly, ordering by row-sums works out to ordering by something like "strength of crossing", right?

We don't make any guarantees about the internal ordering of levels, so that's not an issue.

Apr 11 '25 02:04 palday

For the time being I will do the simplified indexing part and not do the reordering of levels part. However, we can recommend reordering the levels before creating the model.

Apr 11 '25 15:04 dmbates

I wrote the enclosed to do the reordering of levels in a CategoricalArray by incidence. (I'm not a big fan of CategoricalArrays.jl but it seemed the easiest way to accomplish this.)

using CategoricalArrays, DataFrames, MixedModels

"""
    relevel_by_incidence!(v::CategoricalArray; rev=false)

Return `v` with the levels ordered by increasing (decreasing when `rev=true`) incidence.
"""
function relevel_by_incidence!(v::CategoricalArray; rev=false)
    ra = refarray(droplevels!(v))
    counts = zeros(Int, maximum(ra))
    for i in ra
        counts[i] += 1
    end
    levels!(v, levels(v)[sortperm(counts; rev)])
    return v
end

dat = DataFrame(MixedModels.dataset(:insteval))
dat.d = relevel_by_incidence!(categorical(dat.d); rev=true)
m = LinearMixedModel(@formula(y ~ 1 + service + (1|s) + (1|d) + zerocorr(1+service|dept)), dat)
m.A[3].diag

Apr 12 '25 21:04 dmbates