Simplify and inline more FE shape computations
The 2D and 3D shape computations are massive switch statements on the element type and the order. In that form it likely does not make sense to inline them for (potential) perf gainz
But we could switch (or duplicate during a transition period at least) the signature from
template <FEFamily T>
Real fe_lagrange_2D_shape(const ElemType type,
const Elem * elem,
const Order order,
const unsigned int i,
const Point & p)
to
template <FEFamily T, Order order, ElemType type>
inline
Real fe_2D_shape(const Elem * elem,
const unsigned int i,
const Point & p)
and get rid of most the switch statements. I don't think it would be much more code overall. If we templated on i as well, we could get rid of all the switches. Potentially cool for vectorizing and GPUs
The potential gains downstream in MOOSE are about 1-2% in 2D (for now, other things are going down so 1-2% might go up). Possibly more in 3D, and potentially less with higher order
flat flat% sum% cum cum%
9.12s 22.57% 22.57% 9.35s 23.14% tcmalloc::CentralFreeList::Populate
1.87s 4.63% 27.20% 2.24s 5.54% MatSetValues_SeqAIJ
1.69s 4.18% 31.38% 39.78s 98.44% [libsasl2.2.dylib]
1.61s 3.98% 35.36% 2.56s 6.34% MooseMesh::cacheInfo
1.42s 3.51% 38.88% 2.75s 6.81% libMesh::BoundaryInfo::boundary_ids
1.07s 2.65% 41.52% 1.07s 2.65% libMesh::fe_lagrange_1D_linear_shape
0.99s 2.45% 43.97% 0.99s 2.45% hypre_BoomerAMGRelaxHybridGaussSeidel_core
0.91s 2.25% 46.23% 0.97s 2.40% libMesh::Elem::which_child_am_i
0.85s 2.10% 48.33% 1.48s 3.66% MooseVariableData::computeValuesInternal
0.61s 1.51% 49.84% 0.61s 1.51% libMesh::FEMap::compute_single_point_map
0.60s 1.48% 51.32% 0.60s 1.48% libMesh::H1FETransformation::map_dphi
0.55s 1.36% 52.68% 0.60s 1.48% libMesh::Elem::contains_vertex_of
0.54s 1.34% 54.02% 1.61s 3.98% (anonymous namespace)::fe_lagrange_2D_shape
0.54s 1.34% 55.36% 0.99s 2.45% Kernel::computeResidual
0.47s 1.16% 56.52% 0.47s 1.16% libMesh::Face::dim
0.46s 1.14% 57.66% 0.46s 1.14% MooseMesh::getNodeBlockIds
0.40s 0.99% 58.65% 1.14s 2.82% libMesh::FEMap::compute_affine_map
3.98% = 2.65% from 1D shape calc and 1.34% from the 2D, which is the switch statement (there's nothing else there)
note that MOOSE has already duplicated the lagrange shapes code for AD templating I think. But if we template here we could get rid of that with just an extra argument
Getting rid of the ElemType argument would break a half dozen of our internal calls (the ones that pass nullptr for the Elem) right now. Those are calls that I'd like to deprecate (they use the function signature that doesn't work for many non-Lagrange basis types) but I don't seem to have marked them deprecated yet; we'll need a transition period here.
Getting rid of the Order argument I don't understand. The order of the Lagrange variable is not required to be the same as the elem->default_order(), or anything else about the element. libMesh doesn't support superparametric Lagrange variables, but we support (and frequently rely on) subparametric Lagrange. Even our simple Stokes flow examples will call both fe_lagrange_2D_shape(elem->type(), elem, SECOND, 0, p) and fe_lagrange_2D_shape(elem->type(), elem, FIRST, 0, p) in the same computation for the same elem and p and expect to get two different results.
Getting rid of the Order argument I don't understand.
it s not gone gone it would just become a template argument instead of a parameter But you're right that in some places it would just end up having the condition on the order as a switch statement still, because we need more than 1 order
Oh, I'm an idiot - I was looking at the argument signature and didn't notice the change in the template signature.
How does this get rid of any switch statements, though? With the current code in fe_lagrange_shape_2D.C it would just take away one switch statement in each of those functions and replace each one with half a dozen switch statements in the places where those functions are called.
How does this get rid of any switch statements, though? With the current code in fe_lagrange_shape_2D.C it would just take away one switch statement in each of those functions and replace each one with half a dozen switch statements in the places where those functions are called.
You're not wrong. It only completely gets rid of the switches if we know in those places what the order is at compile time. So yeah, not a good idea, unless we start templating moose variables on the order. ElemType same problem. We'd have to be calling these routines from the element rather than from the FE to be able to know the ElemType at compile time. However we'd incur the vtable cost when calling the element routine.
For now, this is what we know in the FE, the dimension and the family.
template <unsigned int Dim, FEFamily T>
class FE : public FEGenericBase<typename FEOutputType<T>::type>
Additional thing I tried:
- removing the error messages in 2D is not enough like in the 1D - linear case to inline. But removing all the cases (just keeping quad4 + first order), we do get the inline we want and it disappears from profiling, "seemingly" not even transferred to the callers
Dropped 381 nodes (cum <= 0.21s)
Showing top 50 nodes out of 224
flat flat% sum% cum cum%
9.42s 22.33% 22.33% 9.67s 22.92% tcmalloc::CentralFreeList::Populate
2.45s 5.81% 28.13% 3.23s 7.66% MooseMesh::cacheInfo
2.08s 4.93% 33.06% 2.30s 5.45% MatSetValues_SeqAIJ
1.75s 4.15% 37.21% 41.58s 98.55% [libsasl2.2.dylib]
1.33s 3.15% 40.37% 2.82s 6.68% libMesh::BoundaryInfo::boundary_ids
1.08s 2.56% 42.92% 1.08s 2.56% hypre_BoomerAMGRelaxHybridGaussSeidel_core
1.08s 2.56% 45.48% 1.11s 2.63% libMesh::Elem::which_child_am_i
0.89s 2.11% 47.59% 1.39s 3.29% MooseVariableData::computeValuesInternal
0.71s 1.68% 49.28% 0.71s 1.68% libMesh::Face::dim
0.68s 1.61% 50.89% 11.33s 26.85% tc_new
0.67s 1.59% 52.48% 0.85s 2.01% libMesh::Elem::contains_vertex_of
0.66s 1.56% 54.04% 1.14s 2.70% Kernel::computeResidual
0.49s 1.16% 55.20% 0.49s 1.16% libMesh::H1FETransformation::map_dphi
0.48s 1.14% 56.34% 0.48s 1.14% libMesh::FEMap::compute_single_point_map
0.45s 1.07% 57.41% 0.45s 1.07% MooseMesh::getNodeBlockIds
0.41s 0.97% 58.38% 0.76s 1.80% libMesh::DofMap::_dof_indices
0.41s 0.97% 59.35% 1.03s 2.44% libMesh::FEMap::compute_affine_map
0.41s 0.97% 60.32% 0.52s 1.23% libMesh::Predicates::level::operator()
0.40s 0.95% 61.27% 1s 2.37% libMesh::UnstructuredMesh::find_neighbors
0.40s 0.95% 62.22% 0.77s 1.83% std::__1::__hash_table::__emplace_unique_key_args
0.39s 0.92% 63.14% 0.46s 1.09% PackedCache::TryGet
0.37s 0.88% 64.02% 1.86s 4.41% tc_delete
0.35s 0.83% 64.85% 0.35s 0.83% PetscSortIntWithDataArray
0.34s 0.81% 65.66% 0.34s 0.81% PetscSFSetUpRanks
0.34s 0.81% 66.46% 0.44s 1.04% libMesh::FE::all_shapes
0.30s 0.71% 67.17% 2.64s 6.26% variant_filter_iterator::Pred::operator()
0.27s 0.64% 67.81% 0.27s 0.64% std::__1::__cxx_atomic_load[abi:ne180100]
0.26s 0.62% 68.43% 0.26s 0.62% Diffusion::computeQpResidual
0.25s 0.59% 69.02% 0.64s 1.52% Assembly::cacheResidualBlock
0.23s 0.55% 69.57% 0.23s 0.55% hypre_BoomerAMGBuildInterp.omp_outlined.5
0.22s 0.52% 70.09% 2.94s 6.97% Assembly::reinitFE
0.22s 0.52% 70.61% 3.68s 8.72% MooseMesh::nodeToActiveSemilocalElemMap
0.21s 0.5% 71.11% 0.24s 0.57% libMesh::(anonymous namespace)::lagrange_compute_constraints
0.20s 0.47% 71.58% 0.26s 0.62% MooseMesh::updateActiveSemiLocalNodeRange
0.20s 0.47% 72.05% 15.29s 36.24% ThreadedElementLoopBase::operator()
0.18s 0.43% 72.48% 0.28s 0.66% Kernel::computeJacobian
0.18s 0.43% 72.91% 0.30s 0.71% libMesh::System::local_dof_indices
0.16s 0.38% 73.29% 0.31s 0.73% Assembly::prepareResidual
0.15s 0.36% 73.64% 0.38s 0.9% Assembly::cacheJacobianBlock
0.15s 0.36% 74.00% 0.43s 1.02% MooseVariableDataBase::fetchDoFValues
0.14s 0.33% 74.33% 1.45s 3.44% libMesh::ElemInternal::find_point_neighbors
0.12s 0.28% 74.61% 0.24s 0.57% tcmalloc::SLL_Push
for the 20 runs simple diffusion, we drop another second in runtime. Average runtime: 40.530512 seconds Standard deviation: .626605 seconds
ElemType same problem.
Maybe we should have the "1 elem type per subdomain" limitation to help us loop over elements with a fixed element type
Maybe we should have the "1 elem type per subdomain" limitation
Definitely not. Elem types are a feature of the mesh. Subdomains are a feature of the domain. A problem should be able to define its domain before even thinking about how it might eventually be meshed.
to help us loop over elements with a fixed element type
This, though, has been a vague wishlist item of mine for forever. It might be a great idea to cache ranges of elements sorted to be "identical" in the important respects (subdomain + type + p-level, at least), then iterate over those ranges when assembling to maximize what we can vectorize and minimize how many times we have to invalidate various downstream caches of e.g. FE data. In the past I've worried that this might be more trouble than benefit, but if the benefit extends to GPUs and not just caches and SIMD then maybe it's time.
Where is the element numbering code in libMesh? I was thinking already sorting by subdomain would achieve something signficant in MOOSE, skipping the "reinitSubdomain" calls. I did not even know about FE data caches
Where is the element numbering code in libMesh?
Everywhere!!!
For some file formats we manually try to match the numbering in the file, but mostly we just add ids as we add new elements (which also matches some file formats because they use contiguous numbering and list data sequentially). Mesh generation code can also set numbers manually, which we often do, whether to reduce the amount of communication we'd otherwise need for something like an extrusion of a distributed mesh or to avoid conflicts between id numbers when stitching meshes.
Then, with DistributedMesh we'll renumber after repartitioning (because contiguous numbers on processors gives us a little extra efficiency in the data structure) and on any mesh we'll renumber after adaptive coarsening (without reordering) to fill in gaps left by coarsened-away elements.
I was thinking already sorting by subdomain would achieve something signficant in MOOSE, skipping the "reinitSubdomain" calls. I did not even know about FE data caches
This would actually be much much quicker and simpler to test than my "add caches for presorted vectors of element pointers" idea. Sort by subdomain first, then by element type within each subdomain? That might be a free speedup for a lot of runs. It might also break a lot of code that's not robust to renumbering but still doesn't disallow renumbering; I'd bet there's at least a little code in MOOSE that only avoids renumbering indirectly by not getting run distributed or with adaptivity.
Seems like if we renumber it during the simulation we need the numbering schemes to be implemented in libmesh (not in a moose mesh generator).
Sort by subdomain first, then by element type within each subdomain?
yes something like that, and possibly keeping the proximity in mind like most initial numberings would.
Yeah, this would need to be an option in ReplicatedMesh::renumber_nodes_and_elements and DistributedMesh::renumber_nodes_and_elements.