Ferrite.jl Introduce OrderedSets

Title should say everything. This should increase cache locality for some problems (e.g. when using our generated grids) and we now have the guarantee that the iteration order for subdofhandlers and faces can be precisely controlled.

Closes https://github.com/Ferrite-FEM/Ferrite.jl/issues/631 .

Nov 01 '23 14:11 termi-official

Okay, even with this there is still a quite big performance hit in using the SubDofHandler.

using Ferrite, SparseArrays, BenchmarkTools
grid = generate_grid(Quadrilateral, (1000, 1000));

ip = Lagrange{RefQuadrilateral, 1}()
qr = QuadratureRule{RefQuadrilateral}(2)
cellvalues = CellValues(qr, ip);

dh = DofHandler(grid)
add!(dh, :u, ip)
close!(dh);

K = create_sparsity_pattern(dh)

ch = ConstraintHandler(dh);

∂Ω = union(
    getfaceset(grid, "left"),
    getfaceset(grid, "right"),
    getfaceset(grid, "top"),
    getfaceset(grid, "bottom"),
);

dbc = Dirichlet(:u, ∂Ω, (x, t) -> 0)
add!(ch, dbc);

close!(ch)

function assemble_element!(Ke::Matrix, fe::Vector, cellvalues::CellValues)
    n_basefuncs = getnbasefunctions(cellvalues)
    ## Reset to 0
    fill!(Ke, 0)
    fill!(fe, 0)
    ## Loop over quadrature points
    for q_point in 1:getnquadpoints(cellvalues)
        ## Get the quadrature weight
        dΩ = getdetJdV(cellvalues, q_point)
        ## Loop over test shape functions
        for i in 1:n_basefuncs
            δu  = shape_value(cellvalues, q_point, i)
            ∇δu = shape_gradient(cellvalues, q_point, i)
            ## Add contribution to fe
            fe[i] += δu * dΩ
            ## Loop over trial shape functions
            for j in 1:n_basefuncs
                ∇u = shape_gradient(cellvalues, q_point, j)
                ## Add contribution to Ke
                Ke[i, j] += (∇δu ⋅ ∇u) * dΩ
            end
        end
    end
    return Ke, fe
end

function assemble_global(cellvalues::CellValues, K::SparseMatrixCSC, dh::DofHandler)
    ## Allocate the element stiffness matrix and element force vector
    n_basefuncs = getnbasefunctions(cellvalues)
    Ke = zeros(n_basefuncs, n_basefuncs)
    fe = zeros(n_basefuncs)
    ## Allocate global force vector f
    f = zeros(ndofs(dh))
    ## Create an assembler
    assembler = start_assemble(K, f)
    ## Loop over all cels
    for cell in CellIterator(dh)
        ## Reinitialize cellvalues for this cell
        reinit!(cellvalues, cell)
        ## Compute element contribution
        assemble_element!(Ke, fe, cellvalues)
        ## Assemble Ke and fe into K and f
        assemble!(assembler, celldofs(cell), Ke, fe)
    end
    return nothing
end


function assemble_global2(cellvalues::CellValues, K::SparseMatrixCSC, dh::DofHandler)
    ## Allocate the element stiffness matrix and element force vector
    n_basefuncs = getnbasefunctions(cellvalues)
    Ke = zeros(n_basefuncs, n_basefuncs)
    fe = zeros(n_basefuncs)
    ## Allocate global force vector f
    f = zeros(ndofs(dh))
    ## Create an assembler
    assembler = start_assemble(K, f)
    ## Loop over all cels
    for cell in CellIterator(dh.subdofhandlers[1])
        ## Reinitialize cellvalues for this cell
        reinit!(cellvalues, cell)
        ## Compute element contribution
        assemble_element!(Ke, fe, cellvalues)
        ## Assemble Ke and fe into K and f
        assemble!(assembler, celldofs(cell), Ke, fe)
    end
    return nothing
end

@btime assemble_global(cellvalues, K, dh); # 494.943 ms (12 allocations: 7.65 MiB)
@btime assemble_global2(cellvalues, K, dh); # 655.668 ms (15 allocations: 7.65 MiB)

cc @kimauth

Nov 01 '23 14:11 termi-official

Did a quick benchmark, and I think this check: https://github.com/Ferrite-FEM/Ferrite.jl/blob/fe44e7f6297158e9bbb62fd4bc9adaa1d79929f2/src/Dofs/DofHandler.jl#L192

Is the reason for the worse performance. At least it would make sense to provide a fast unsafe path IMO (or, a quick solution for CellCache is to just pass the sdh.dh) instead. Theoretically, using the SubDofHandler could be faster by one lookup, because ndofs_per_cell is just one value...

Nov 01 '23 15:11 KnutAM

Thanks for the pointer. If the performance issue is nothing conceptual, then I would leave the PR as it is (+- updating the deps) and fixing it is left for another PR.

Nov 01 '23 16:11 termi-official

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.86%. Comparing base (ccf583d) to head (bcae4a4).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #834      +/-   ##
==========================================
+ Coverage   93.84%   93.86%   +0.01%     
==========================================
  Files          36       36              
  Lines        5293     5310      +17     
==========================================
+ Hits         4967     4984      +17     
  Misses        326      326

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Nov 01 '23 18:11 codecov-commenter

Did a quick benchmark, and I think this check:

https://github.com/Ferrite-FEM/Ferrite.jl/blob/fe44e7f6297158e9bbb62fd4bc9adaa1d79929f2/src/Dofs/DofHandler.jl#L192

Is the reason for the worse performance. At least it would make sense to provide a fast unsafe path IMO (or, a quick solution for CellCache is to just pass the sdh.dh) instead.

Confirmed locally that dropping this check on this branch brings the performance close to each other:

julia> @btime assemble_global(cellvalues, K, dh); # 494.943 ms (12 allocations: 7.65 MiB)
  478.872 ms (12 allocations: 7.65 MiB)

julia> @btime assemble_global2(cellvalues, K, dh); # 655.668 ms (15 allocations: 7.65 MiB)
  499.271 ms (15 allocations: 7.65 MiB)

where the comment is the reference timing with the check included. Note that the timings on my machine are not super stable. Some of the remaining time difference can be attributed to the iteration of the OrderedSet, which increases the memory pressure. Hence I think it is still worth to investigate the use of ranges in the iterators. Maybe we can add a boolean to the SubDofHandler with information about having a continuous range or not? Or we can take another look at https://github.com/Ferrite-FEM/Ferrite.jl/pull/625

Theoretically, using the SubDofHandler could be faster by one lookup, because ndofs_per_cell is just one value...

If that is the case, then we should do it and force the CellIterator to iterate over the SubDofHandler in the case that there is only 1 subdofhandler.

Nov 02 '23 12:11 termi-official

Also, supersedes https://github.com/Ferrite-FEM/Ferrite.jl/pull/654/files

Nov 14 '23 13:11 termi-official

Fixed up after #914.

May 19 '24 13:05 fredrikekre

Would probably be nice to automatically sort the sets when adding a regular set to the grid, but can be done later.

May 19 '24 15:05 KnutAM

@KnutAM something isn't right here with the porous media example (probably something with FerriteMeshParser?), can you have a look?

May 19 '24 22:05 fredrikekre

FWIW, there have been quite a few improvements to the Base Dict that has not been carried over to the OrderedDict. I am not sure this matters in practice but:

If someone is interested it might be worth trying to bring those improvements over to OrderedDict.
It is worth benchmarking again on newer Julia versions (1.11).

May 20 '24 08:05 KristofferC

I'm just stuck with a feeling that this isn't the optimal solution, but I also don't have a better alternative right now.

I totally agree here on both points. I have tried different designs an none of them were really satisfactory to me.

Would probably be nice to automatically sort the sets when adding a regular set to the grid, but can be done later.

Why? Practially this only makes sense if the underlying grid has some ordering as in our generated ones and if your subdomain also has to follow this order. In generated meshes like e.g. the ones from Gmsh the optimal ordering is almost always not the ordering imposed by the numbering of the elements and you can mess up the performance if you sort them.

* It is worth benchmarking again on newer Julia versions (1.11).

The Set or the OrderedSet version?

May 20 '24 09:05 termi-official

I totally agree here on both points. I have tried different designs an none of them were really satisfactory to me.

What's wrong with this? We can can relax docs to say that getcellset etc return a collection instead of explicitly OrderedSet if you want the freedom to change it in the future.

Would probably be nice to automatically sort the sets when adding a regular set to the grid

See the comments in the code. addcellset! etc loop over the cells in order so the order of the added set will match. If you bring your own collection then the order is preserved.

May 20 '24 09:05 fredrikekre

something isn't right here with the porous media example (probably something with FerriteMeshParser?), can you have a look?

Hopefully https://github.com/Ferrite-FEM/Ferrite.jl/pull/834/commits/14d880092c3890e077a5c7ce81b350c26d74c4b9 fixes it. I also tested locally that it works without the overload with https://github.com/Ferrite-FEM/FerriteMeshParser.jl/pull/43. That is a breaking release for FerriteMeshParser.jl though, so need to validate that no other breaking changes are required to support Ferrite v1 before merging and releasing. Once that's done, I'll PR removing that hack, and do the same for FerriteGmsh.

May 20 '24 12:05 KnutAM

We can can relax docs to say that getcellset etc return a collection instead of explicitly OrderedSet if you want the freedom to change it in the future.

I think it would make sense to say that it returns an ordered collection as I think this PR shows that we want to keep that.

Would probably be nice to automatically sort the sets when adding a regular set to the grid

See the comments in the code. addcellset! etc loop over the cells in order so the order of the added set will match. If you bring your own collection then the order is preserved.

Right, that makes sense, nvm!

What's wrong with this?

I thought OrderedSet stored the information twice compared to a Set, but I seem to be wrong, so maybe this is the optimal approach (of course, when possible a UnitRange is better but not always applicable and Vector is smaller but has slow in).

struct Index
    t::NTuple{2,Int}
end
cells = rand(1:10000, 1000);
facets = rand(1:4, 1000);
inds = [Index((c, f)) for (c, f) in zip(cells, facets)];

julia> Base.summarysize.((inds, Set(inds), OrderedSet(inds)))
(16040, 35008, 32320)

May 20 '24 13:05 KnutAM

If you need a vector we can expose the internal vector of the set.

May 20 '24 15:05 fredrikekre

IIRC one open issue here is that we cannot directly parallelize over the ordered set. We want to add an how to for this.

May 20 '24 16:05 termi-official

Ferrite.jl Ferrite.jl copied to clipboard

Introduce OrderedSets

Codecov Report

Ferrite.jl
Ferrite.jl copied to clipboard