dolfinx icon indicating copy to clipboard operation
dolfinx copied to clipboard

Performance regression in serial mesh library

Open blechta opened this issue 6 years ago • 3 comments

DOLFIN:

time python3 -c"from dolfin import *; UnitCubeMesh(MPI.comm_world, 100, 100, 100)"

real    0m0.867s
user    0m0.761s
sys    0m0.106s

DOLFIN-X:

time python3 -c"from dolfin import *; UnitCubeMesh(MPI.comm_world, 100, 100, 100)"

real    0m15.661s
user    0m14.440s
sys    0m1.223s

blechta avatar Jun 12 '18 11:06 blechta

This is due to the serial implementation using the same code as in parallel. For example, it calculates the dual graph for partitioning the mesh (not needed in serial).

chrisrichardson avatar Jun 12 '18 12:06 chrisrichardson

Confirming that this is still an issue: DOLFINx

python3 -c"import dolfinx; from mpi4py import MPI; dolfinx.UnitCubeMesh(MPI.COMM_WORLD, 100, 100, 100)"

real	0m11.179s
user	0m9.546s
sys	0m1.611s

DOLFIN:

fenics@f596c6c6e934:/root/shared$ time python3 -c"from dolfin import *; UnitCubeMesh(MPI.comm_world, 100, 100, 100)"

real	0m1.046s
user	0m0.907s
sys	0m0.140s

jorgensd avatar Apr 11 '21 10:04 jorgensd

To avoid computing the dual graph in serial we can call a custom partitioner, which sets the destination of all cells to process 0 (in serial):

from mpi4py import MPI
import dolfinx
import numpy


def serial_partitioner(mpi_comm, nparts, tdim, cells, ghost_mode):
    dest = numpy.zeros(cells.num_nodes, dtype=numpy.int32)
    return dolfinx.cpp.graph.AdjacencyList_int32(dest)


mesh = dolfinx.UnitCubeMesh(
    MPI.COMM_WORLD, 100, 100, 100, partitioner=serial_partitioner)

dolfinx.list_timings(MPI.COMM_WORLD, [dolfinx.TimingType.wall])

with custom partitioner:

real 0m6.091s user 0m4.249s sys 0m2.458s

[MPI_AVG] Summary of timings                                   |  reps  wall avg  wall tot
------------------------------------------------------------------------------------------
Build BoxMesh                                                  |     1  4.254135  4.254135
Build dofmap data                                              |     1  1.060930  1.060930
Compute SCOTCH graph re-ordering                               |     1  0.141512  0.141512
Compute dof reordering map                                     |     1  0.666798  0.666798
Compute local-to-local map                                     |     1  0.068058  0.068058
Compute-local-to-global links for global/local adjacency list  |     1  0.042887  0.042887
Distribute in graph creation AdjacencyList                     |     1  0.509226  0.509226
Fetch float data from remote processes                         |     1  0.029057  0.029057
Init dofmap from element dofmap                                |     1  0.343332  0.343332
SCOTCH: call SCOTCH_graphBuild                                 |     1  0.000490  0.000490
SCOTCH: call SCOTCH_graphOrder                                 |     1  0.121626  0.121626
TOPOLOGY: Create sets                                          |     1  0.735610  0.735610

with standard partitioner:

real 0m9.116s user 0m6.854s sys 0m2.900s

[MPI_AVG] Summary of timings                                   |  reps  wall avg  wall tot
------------------------------------------------------------------------------------------
Build BoxMesh                                                  |     1  7.280152  7.280152
Build dofmap data                                              |     1  1.074936  1.074936
Compute SCOTCH graph re-ordering                               |     1  0.140453  0.140453
Compute dof reordering map                                     |     1  0.675271  0.675271
Compute graph partition (SCOTCH)                               |     1  0.338212  0.338212
Compute local part of mesh dual graph                          |     1  2.617081  2.617081
Compute local-to-local map                                     |     1  0.069505  0.069505
Compute non-local part of mesh dual graph                      |     1  0.047709  0.047709
Compute-local-to-global links for global/local adjacency list  |     1  0.044120  0.044120
Distribute in graph creation AdjacencyList                     |     1  0.515640  0.515640
Extract partition boundaries from SCOTCH graph                 |     1  0.029006  0.029006
Fetch float data from remote processes                         |     1  0.032896  0.032896
Get SCOTCH graph data                                          |     1  0.000000  0.000000
Init dofmap from element dofmap                                |     1  0.348799  0.348799
SCOTCH: call SCOTCH_dgraphBuild                                |     1  0.003080  0.003080
SCOTCH: call SCOTCH_dgraphHalo                                 |     1  0.035761  0.035761
SCOTCH: call SCOTCH_dgraphPart                                 |     1  0.190264  0.190264
SCOTCH: call SCOTCH_graphBuild                                 |     1  0.000497  0.000497
SCOTCH: call SCOTCH_graphOrder                                 |     1  0.120793  0.120793
TOPOLOGY: Create sets                                          |     1  0.739001  0.739001

IgorBaratta avatar Apr 11 '21 11:04 IgorBaratta

Updated syntax:

time python -c "from dolfinx.mesh import create_unit_cube, CellType; from mpi4py import MPI; create_unit_cube(MPI.COMM_WORLD, 100, 100, 100, cell_type=CellType.tetrahedron)"

garth-wells avatar Mar 07 '23 08:03 garth-wells

def serial_partitioner(mpi_comm, nparts, tdim, cells, ghost_mode):
    dest = numpy.zeros(cells.num_nodes, dtype=numpy.int32)
    return dolfinx.cpp.graph.AdjacencyList_int32(dest)

New syntax:

def serial_partitioner(comm, n, m, topo):
    dest = np.zeros(topo.num_nodes, dtype=np.int32)

    return dolfinx.cpp.graph.AdjacencyList_int32(dest)

jorgensd avatar Mar 20 '23 11:03 jorgensd

OK, so we could (automatically) just call this "null" partitioner, when running in serial. It knocks of about 25% of the time. On my mac, it goes down from about 11s to 8s. However, if we look at the timings, with dolfinx.list_timings, we see:

Compute local part of mesh dual graph                                       |     1  2.886924  2.886924
Topology: create                                                            |     1  3.307674  3.307674

The local dual graph is still computed, because it is used for reordering. Probably this didn't happen in old dolfin, which is why it is so fast to create a simple mesh. I really wonder if we shouldn't just close this issue as "won't fix"...

chrisrichardson avatar Jul 13 '23 14:07 chrisrichardson