dolfinx Performance regression in collapsing blocked subspace

Summarize the issue

Collapsing a blocked subspace is orders of magnitude slower than collapsing the scalar version and increases dramatically with increase of polynomial order.

First reported at: https://fenicsproject.discourse.group/t/collapsing-mixed-space-takes-unreasonably-long/15071

How to reproduce the bug

Run MWE below

Minimal Example (Python)

from mpi4py import MPI

from dolfinx import mesh, fem
from basix.ufl import element, mixed_element
import time

N = 5
P = 3

# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()

V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")
start3 = time.perf_counter()

Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")

Output (Python)

Meshing and creating mixed space: 0.00762048000001414
Collapsing V: 0.8559996170000659
Collapsing Q: 0.0018857679999655375

Version

main branch

DOLFINx git commit

9f33dabc270b8e3c31e815515ecac5eb2a5414b0

Installation

Docker using: ghcr.io/fenics/dolfinx/dolfinx:nightly

Additional information

No response

Jun 25 '24 16:06 jorgensd

dolfinx::fem::DofMap::collapse() is the problem. Especially, it is the compute_reordering_map in build_dofmap_data that is slow (and not scalable with polynomial order).

Jun 25 '24 17:06 jorgensd

I think we should turn of data reordering in serial. Running:

from mpi4py import MPI

from dolfinx import mesh, fem, la, log
from basix.ufl import element, mixed_element
import time

# log.set_log_level(log.LogLevel.INFO)
N = 5
P = 4

# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()
V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")

start3 = time.perf_counter()

Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")

returns the following in serial:

Meshing and creating mixed space: 0.013051556999926106
Collapsing V: 7.033872703999805
Collapsing Q: 0.0023294779998650483

and

Meshing and creating mixed space: 0.020997950000037235
Collapsing V: 0.70820857300032
Collapsing Q: 0.003953482000270014
Meshing and creating mixed space: 0.020988049000152387
Collapsing V: 0.7082019069998751
Collapsing Q: 0.003954929999963497

with 2 procs, i.e. a 10 x speedup! Three processes:

Meshing and creating mixed space: 0.02864133799994306
Collapsing V: 0.34778579499970874
Collapsing Q: 0.003480805999970471
Meshing and creating mixed space: 0.027170727000338957
Collapsing V: 0.34781397200003994
Collapsing Q: 0.0034789139999702456
Meshing and creating mixed space: 0.03079914899990399
Collapsing V: 0.3478091659999336
Collapsing Q: 0.0034698189997470763

I still think it is remarkable that creating the collapsed space is more than 10 x more expensive than creating the mixed space itself. Any comments @chrisrichardson @mscroggs @garth-wells ?

Jun 25 '24 17:06 jorgensd

We shouldn't turn off re-ordering in serial.

Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?

Jun 26 '24 09:06 garth-wells

We shouldn't turn off re-ordering in serial.

Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?

It is prior to 0.7 release. Haven’t tested it for 0.6 yet

Jun 26 '24 10:06 jorgensd

I don't think mixed topology changes would affect this. list_timings() reveals:

GPS: create_level_structure                                                 |  4351  0.001285  5.590000
Gibbs-Poole-Stockmeyer ordering                                             |     2  2.885000  5.770000

which seems strange. Looks like some kind of internal issue with the reordering. GPS is only called twice, but builds the level structure thousands of times.

Jun 26 '24 10:06 chrisrichardson

It is present in 0.6 as well: P=3

Meshing and creating mixed space: 0.5712962300000015
Collapsing V: 0.7795458059999874
Collapsing Q: 0.0005138619999911498

P=4

Meshing and creating mixed space: 0.6977521319999767
Collapsing V: 7.012910768000012
Collapsing Q: 0.0005572910000068987

Jun 26 '24 10:06 jorgensd

I added a line in DofMap.cpp:

 if (!reorder_fn)
  {
    reorder_fn = [](const graph::AdjacencyList<std::int32_t>& g)
    {
      double nnz_per_row
          = g.offsets().back() / (double)(g.offsets().size() - 1);

      spdlog::info("DofMap collapse: reordering graph with {} nnz per row",
                   nnz_per_row);

      return graph::reorder_gps(g);
    };
  }

[2024-06-26 11:30:44.174] [info] DofMap collapse: reordering graph with 106.171875 nnz per row

so it is quite a dense graph

Jun 26 '24 10:06 chrisrichardson

@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2). I think you will then find it is not much different.

Jun 26 '24 14:06 chrisrichardson

We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.

Jun 26 '24 14:06 garth-wells

@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2). I think you will then find it is not much different.

As in not improving further?

Jun 26 '24 15:06 jorgensd

We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.

I think one of the main things could Also be the block size, as you can see that an unblocked (scalar) element is magnitudes of order faster than its vector counterpart.

Jun 26 '24 21:06 jorgensd

dolfinx dolfinx copied to clipboard

Performance regression in collapsing blocked subspace

Summarize the issue

How to reproduce the bug

Minimal Example (Python)

Output (Python)

Version

DOLFINx git commit

Installation

Additional information

dolfinx
dolfinx copied to clipboard