dolfinx
dolfinx copied to clipboard
Performance regression in serial mesh library
DOLFIN:
time python3 -c"from dolfin import *; UnitCubeMesh(MPI.comm_world, 100, 100, 100)"
real 0m0.867s
user 0m0.761s
sys 0m0.106s
DOLFIN-X:
time python3 -c"from dolfin import *; UnitCubeMesh(MPI.comm_world, 100, 100, 100)"
real 0m15.661s
user 0m14.440s
sys 0m1.223s
This is due to the serial implementation using the same code as in parallel. For example, it calculates the dual graph for partitioning the mesh (not needed in serial).
Confirming that this is still an issue: DOLFINx
python3 -c"import dolfinx; from mpi4py import MPI; dolfinx.UnitCubeMesh(MPI.COMM_WORLD, 100, 100, 100)"
real 0m11.179s
user 0m9.546s
sys 0m1.611s
DOLFIN:
fenics@f596c6c6e934:/root/shared$ time python3 -c"from dolfin import *; UnitCubeMesh(MPI.comm_world, 100, 100, 100)"
real 0m1.046s
user 0m0.907s
sys 0m0.140s
To avoid computing the dual graph in serial we can call a custom partitioner, which sets the destination of all cells to process 0 (in serial):
from mpi4py import MPI
import dolfinx
import numpy
def serial_partitioner(mpi_comm, nparts, tdim, cells, ghost_mode):
dest = numpy.zeros(cells.num_nodes, dtype=numpy.int32)
return dolfinx.cpp.graph.AdjacencyList_int32(dest)
mesh = dolfinx.UnitCubeMesh(
MPI.COMM_WORLD, 100, 100, 100, partitioner=serial_partitioner)
dolfinx.list_timings(MPI.COMM_WORLD, [dolfinx.TimingType.wall])
with custom partitioner:
real 0m6.091s user 0m4.249s sys 0m2.458s
[MPI_AVG] Summary of timings | reps wall avg wall tot
------------------------------------------------------------------------------------------
Build BoxMesh | 1 4.254135 4.254135
Build dofmap data | 1 1.060930 1.060930
Compute SCOTCH graph re-ordering | 1 0.141512 0.141512
Compute dof reordering map | 1 0.666798 0.666798
Compute local-to-local map | 1 0.068058 0.068058
Compute-local-to-global links for global/local adjacency list | 1 0.042887 0.042887
Distribute in graph creation AdjacencyList | 1 0.509226 0.509226
Fetch float data from remote processes | 1 0.029057 0.029057
Init dofmap from element dofmap | 1 0.343332 0.343332
SCOTCH: call SCOTCH_graphBuild | 1 0.000490 0.000490
SCOTCH: call SCOTCH_graphOrder | 1 0.121626 0.121626
TOPOLOGY: Create sets | 1 0.735610 0.735610
with standard partitioner:
real 0m9.116s user 0m6.854s sys 0m2.900s
[MPI_AVG] Summary of timings | reps wall avg wall tot
------------------------------------------------------------------------------------------
Build BoxMesh | 1 7.280152 7.280152
Build dofmap data | 1 1.074936 1.074936
Compute SCOTCH graph re-ordering | 1 0.140453 0.140453
Compute dof reordering map | 1 0.675271 0.675271
Compute graph partition (SCOTCH) | 1 0.338212 0.338212
Compute local part of mesh dual graph | 1 2.617081 2.617081
Compute local-to-local map | 1 0.069505 0.069505
Compute non-local part of mesh dual graph | 1 0.047709 0.047709
Compute-local-to-global links for global/local adjacency list | 1 0.044120 0.044120
Distribute in graph creation AdjacencyList | 1 0.515640 0.515640
Extract partition boundaries from SCOTCH graph | 1 0.029006 0.029006
Fetch float data from remote processes | 1 0.032896 0.032896
Get SCOTCH graph data | 1 0.000000 0.000000
Init dofmap from element dofmap | 1 0.348799 0.348799
SCOTCH: call SCOTCH_dgraphBuild | 1 0.003080 0.003080
SCOTCH: call SCOTCH_dgraphHalo | 1 0.035761 0.035761
SCOTCH: call SCOTCH_dgraphPart | 1 0.190264 0.190264
SCOTCH: call SCOTCH_graphBuild | 1 0.000497 0.000497
SCOTCH: call SCOTCH_graphOrder | 1 0.120793 0.120793
TOPOLOGY: Create sets | 1 0.739001 0.739001
Updated syntax:
time python -c "from dolfinx.mesh import create_unit_cube, CellType; from mpi4py import MPI; create_unit_cube(MPI.COMM_WORLD, 100, 100, 100, cell_type=CellType.tetrahedron)"
def serial_partitioner(mpi_comm, nparts, tdim, cells, ghost_mode): dest = numpy.zeros(cells.num_nodes, dtype=numpy.int32) return dolfinx.cpp.graph.AdjacencyList_int32(dest)
New syntax:
def serial_partitioner(comm, n, m, topo):
dest = np.zeros(topo.num_nodes, dtype=np.int32)
return dolfinx.cpp.graph.AdjacencyList_int32(dest)
OK, so we could (automatically) just call this "null" partitioner, when running in serial. It knocks of about 25% of the time. On my mac, it goes down from about 11s to 8s.
However, if we look at the timings, with dolfinx.list_timings
, we see:
Compute local part of mesh dual graph | 1 2.886924 2.886924
Topology: create | 1 3.307674 3.307674
The local dual graph is still computed, because it is used for reordering. Probably this didn't happen in old dolfin, which is why it is so fast to create a simple mesh. I really wonder if we shouldn't just close this issue as "won't fix"...