libmesh Shorten, simplify and generalise code for Nédélec finite elements

Hi Roy (@roystgnr), Alex (@lindsayad),

The overarching idea here is to generalise (some of) the code for arbitrary order finite elements. I'm aware this would be a mammoth task, but I thought we could start with some small steps to better grasp the range of changes that'd be needed.

Since we are working on adding support for 2nd order Nédélec elements of the first kind, I thought instead of just hard-coding the various properties (e.g. the no. of dofs) we could try to generalise them. In the same spirit, for the shape functions, since they are orientation-dependent, I tried to avoid writing them twice as we had done thus far because even though that's not a lot for 1st order, trust me, it is for 2nd order. Here, it'd also be interesting to experiment with obtaining the shape functions from the dof definitions, although this would (a bit more fundamentally) change the way we think about assembly (we'd need to store the shape functions a-priori for a given element type and order perhaps the first time we see such an element).

Lastly, I'd like to understand what n_dofs_per_elem is supposed to return, and in particular its relationship with the middle node on TRI7, QUAD9 or HEX27 elements. For 2nd order Nédélec, we do have some dofs that are associated with the element, so where should they go? The middle node, or the element?

Cheers, -Nuno

Apr 23 '24 17:04 nmnobre

Job Coverage on b377c8b wanted to post the following:

Coverage

	196a0a	#3840 b377c8
	Total	Total	+/-	New
Rate	62.69%	62.70%	+0.01%	80.17%
Hits	69280	69242	-38	97
Misses	41226	41194	-32	24

Diff coverage report

Full coverage report

Warnings

New new line coverage rate 80.17% is less than the suggested 90.0%

This comment will be updated on new commits.

Apr 24 '24 10:04 moosebuild

I thought instead of just hard-coding the various properties (e.g. the no. of dofs) we could try to generalise them.

This is usually a good idea and a pretty easy one; see e.g. fe_hierarchic.C for a decent example of it.

Heavily refactoring shape function definitions is also usually a good idea and usually not a pretty easy one; see e.g. fe_hierarchic_shape_3D.C for an it-was-the-best-I-could-do example of it.

we'd need to store the shape functions a-priori for a given element type and order perhaps the first time we see such an element

With Lagrange shape functions we cache results in between elements, but (unless the elements that are just shifts of each other, where we cache everything for everything) it doesn't quite save as much as you'd think; you don't have to recompute in master space each time but the physical space gradients have some cost.

n_dofs_per_elem is intended to return the number of DoFs which are stored on the Elem. If you have an element with a middle node and you've got a nodal basis like LAGRANGE or BERNSTEIN then the node is where any "bubble" shape functions belong and that's where we store them; if not then we usually just put them on the Elem; there's no big difference one way or another.

Apr 26 '24 19:04 roystgnr

This is usually a good idea and a pretty easy one; see e.g. fe_hierarchic.C for a decent example of it.

Perfect, my code was already very similar to that, but I've refactored to follow the same style.

Heavily refactoring shape function definitions is also usually a good idea and usually not a pretty easy one; see e.g. fe_hierarchic_shape_3D.C for an it-was-the-best-I-could-do example of it.

Okay, let's leave that for later then. Let's focus on fe_nedelec.one.C for now, and in particular on nedelec_one_nodal_soln(). I got a bit confused with the logic surrounding order and total_order, and fe_type (previously used for n_sf) and p_refined_fe_type (used for vis_fe), could you please confirm the changes I made are sound?

if not then we usually just put them on the Elem; there's no big difference one way or another.

Good, stored on the element then.

Apr 29 '24 12:04 nmnobre

Sorry for the late review; this all looks like obvious improvements to me (assuming the higher-o cases are correct; I can't wait until we have that working!).

I'm going to be out shortly, but I should find time to merge this late today regardless unless anyone objects first.

Jun 10 '24 14:06 roystgnr

Hi Roy,

Thanks a lot for finding the time to review this. Spoiler: we have 2nd order implementations on both QUADs and TRIs working already.

There's two things I'd like to ask before you merge:

If you are happy with 0f0ca51, ideally n_dofs would move to fe.C and we could remove the specialisation for most FE families. Off the top of your head, are there any FE families (or any other reason) this couldn't happen?
In the same vein, could you have a close look at nedelec_one_nodal_soln()? Except for the checks which might be particular to specific FE families, the method we're using to evaluate the solution at the nodes is pretty general... I understand for nodal variables this might not be optimal (performance-wise) but it'd certainly still work (i.e. it'd still compute the correct result). Also, what's currently happening at AMR interfaces? Do we need special provisions there?

Jun 10 '24 14:06 nmnobre

If you are happy with https://github.com/libMesh/libmesh/commit/0f0ca51fcdf7e695c0c13c3a2f2a3ef4884c662c

I'm fine with it, but I actually wouldn't have gone in that direction myself.

are there any FE families (or any other reason) this couldn't happen?

There are no families for which this couldn't happen, but there are two reasons why IMHO it shouldn't:

We need to benchmark it first, in a worst-case (first-order LAGRANGE on HEX27?) scenario. We've done a lot of micro-optimization over the years, at least for the LAGRANGE case, and I'd hate to only find out a year down the road that switching
We get just a little bit more error checking in METHOD=dbg, in DofMap::dof_indices IIRC, where we extend a vector of indices n_dofs_per_elem and n_dofs_per_node at a time, but then at the end assert that we got n_dofs out of it. This has caught bugs in new FE types and new specializations of FE types in the past, but if we replaced n_dofs with something that sums n_dofs_per_elem and n_dofs_per_node at a time then we're just replacing that verification with a tautology. I didn't say anything about that commit because I'm willing to grand that you don't need the extra bug-checking, but I'm afraid I need it.

In the same vein, could you have a close look at nedelec_one_nodal_soln()? Except for the checks which might be particular to specific FE families, the method we're using to evaluate the solution at the nodes is pretty general... I understand for nodal variables this might not be optimal (performance-wise) but it'd certainly still work

I didn't see any problems, but I fear my disdain for nodal_soln() might have me subconsciously not looking as hard as I should have been; IIRC I wrote RATIONAL_BERNSTEIN nodal_soln() with a bug not too long ago. I wouldn't worry at all about performance there, though, it gets called once per solution output (e.g. once per timestep) when we're decimating for visualization, and because of that whole "decimating" thing it should never be getting called from anything like a residual assembly that might happen more frequently.

Also, what's currently happening at AMR interfaces?

When we see a non-conforming interface, we call FE<dim,NEDELEC_ONE>::compute_constraints() to calculate the constraint equations needed to make the solution conforming on that interface, and that ends up throwing libmesh_not_implemented()

Do we need special provisions there?

I'm afraid so. We have a generic compute_proj_constraints that looks at FE continuity type, and then either returns (DISCONTINUOUS) or calculates the constraints via local L2 (C_ZERO, SIDE_DISCONTINUOUS) or H1 (C_ONE) projections, but nobody's added support for H_DIV or H_CURL there yet.

Jun 10 '24 18:06 roystgnr

There are no families for which this couldn't happen, but there are two reasons why IMHO it shouldn't:

We need to benchmark it first, in a worst-case (first-order LAGRANGE on HEX27?) scenario. We've done a lot of micro-optimization over the years, at least for the LAGRANGE case, and I'd hate to only find out a year down the road that switching

We get just a little bit more error checking in METHOD=dbg, in DofMap::dof_indices IIRC, where we extend a vector of indices n_dofs_per_elem and n_dofs_per_node at a time, but then at the end assert that we got n_dofs out of it. This has caught bugs in new FE types and new specializations of FE types in the past, but if we replaced n_dofs with something that sums n_dofs_per_elem and n_dofs_per_node at a time then we're just replacing that verification with a tautology. I didn't say anything about that commit because I'm willing to grand that you don't need the extra bug-checking, but I'm afraid I need it.

These are exactly the two points I was worried about: 1) performance and 2) that we lose a potential check just like what you described (even though I couldn't find it). Before adding that commit, I actually checked myself that adding the element and node dofs produced the expression I was writing down under n_dofs. I'll recover that by dropping the commit. In any case, my intent was always to either replace n_dofs either for all or for no FE families, not just for Nédélec. Consistency should be queen. :)

Jun 10 '24 23:06 nmnobre

I wouldn't worry at all about performance there, though, it gets called once per solution output (e.g. once per timestep) when we're decimating for visualization

Perfect, so are we saying we can move this to fe.C and call that same method for all families (except for any specific checks like element type)? - hopefully, by applying this to all other families, CI would also flag any obvious mistakes we haven't spotted?

and that ends up throwing libmesh_not_implemented()

Hm, that's good. That means we can proceed without AMR support until we need it? 😇

Jun 10 '24 23:06 nmnobre

so are we saying we can move this to fe.C and call that same method for all families (except for any specific checks like element type)? - hopefully, by applying this to all other families, CI would also flag any obvious mistakes we haven't spotted?

We definitely still want to allow individual families to override the default, and we'll want to keep the existing per-family code for cases where there's a property that makes for simple+cheap reimplementation (e.g. LAGRANGE or SCALAR), or where there's a serious performance issue (e.g. RATIONAL_*), or where there's an issue in even defining a nodal solution sensibly (SIDE_HIERARCHIC) ... but yeah, I think at least 80% of our FE families probably ought to be calling the same code here. Let's save that for a separate PR, though.

That means we can proceed without AMR support until we need it?

Sadly, but yes. :sweat_smile:

Jun 11 '24 14:06 roystgnr

For completion and future reference.

I compiled libMesh with MOOSE's default configuration plus --enable-perflog. Then ran the 2d Nédélec one example, i.e. vector_fe/ex3, with grid_size = 100, 5 times with for i in {1..5}; do ./example-opt element_type=TRI7 -pc_type jacobi | grep "| FE " -A2; done.

BEFORE:

| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0368      0.000000    0.0368      0.000000    0.50     0.50     |
|   init_shape_functions()           121600     0.0386      0.000000    0.0386      0.000000    0.52     0.52     |
| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0382      0.000000    0.0382      0.000000    0.51     0.51     |
|   init_shape_functions()           121600     0.0386      0.000000    0.0386      0.000000    0.52     0.52     |
| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0379      0.000000    0.0379      0.000000    0.51     0.51     |
|   init_shape_functions()           121600     0.0384      0.000000    0.0384      0.000000    0.52     0.52     |
| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0377      0.000000    0.0377      0.000000    0.51     0.51     |
|   init_shape_functions()           121600     0.0382      0.000000    0.0382      0.000000    0.52     0.52     |
| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0382      0.000000    0.0382      0.000000    0.52     0.52     |
|   init_shape_functions()           121600     0.0386      0.000000    0.0386      0.000000    0.52     0.52     |

AFTER:

| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0423      0.000000    0.0423      0.000000    0.57     0.57     |
|   init_shape_functions()           121600     0.0396      0.000000    0.0396      0.000000    0.53     0.53     |
| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0424      0.000000    0.0424      0.000000    0.57     0.57     |
|   init_shape_functions()           121600     0.0387      0.000000    0.0387      0.000000    0.52     0.52     |
| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0429      0.000000    0.0429      0.000000    0.58     0.58     |
|   init_shape_functions()           121600     0.0394      0.000000    0.0394      0.000000    0.53     0.53     |
| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0436      0.000000    0.0436      0.000000    0.59     0.59     |
|   init_shape_functions()           121600     0.0396      0.000000    0.0396      0.000000    0.53     0.53     |
| FE                                                                                                              |
|   compute_shape_functions()        121600     0.0428      0.000000    0.0428      0.000000    0.58     0.58     |
|   init_shape_functions()           121600     0.0401      0.000000    0.0401      0.000000    0.54     0.54     |

Same, but with grid_size = 150...

BEFORE:

| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0859      0.000000    0.0859      0.000000    0.41     0.41     |
|   init_shape_functions()           272400     0.0851      0.000000    0.0851      0.000000    0.41     0.41     |
| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0845      0.000000    0.0845      0.000000    0.40     0.40     |
|   init_shape_functions()           272400     0.0857      0.000000    0.0857      0.000000    0.41     0.41     |
| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0857      0.000000    0.0857      0.000000    0.41     0.41     |
|   init_shape_functions()           272400     0.0859      0.000000    0.0859      0.000000    0.41     0.41     |
| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0857      0.000000    0.0857      0.000000    0.41     0.41     |
|   init_shape_functions()           272400     0.0859      0.000000    0.0859      0.000000    0.41     0.41     |
| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0854      0.000000    0.0854      0.000000    0.41     0.41     |
|   init_shape_functions()           272400     0.0862      0.000000    0.0862      0.000000    0.41     0.41     |

AFTER:

| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0964      0.000000    0.0964      0.000000    0.46     0.46     |
|   init_shape_functions()           272400     0.0880      0.000000    0.0880      0.000000    0.42     0.42     |
| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0959      0.000000    0.0959      0.000000    0.46     0.46     |
|   init_shape_functions()           272400     0.0862      0.000000    0.0862      0.000000    0.41     0.41     |
| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0949      0.000000    0.0949      0.000000    0.45     0.45     |
|   init_shape_functions()           272400     0.0867      0.000000    0.0867      0.000000    0.41     0.41     |
| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0966      0.000000    0.0966      0.000000    0.46     0.46     |
|   init_shape_functions()           272400     0.0893      0.000000    0.0893      0.000000    0.43     0.43     |
| FE                                                                                                              |
|   compute_shape_functions()        272400     0.0965      0.000000    0.0965      0.000000    0.46     0.46     |
|   init_shape_functions()           272400     0.0880      0.000000    0.0880      0.000000    0.42     0.42     |

Jun 12 '24 21:06 nmnobre

That's not a huge difference but it is much more than I'd expected. Init might be 1% slower or that might be my imagination, but compute looks 15% slower? IMHO that's worth it for more general and simpler code, but it's at least debatable.

Jun 12 '24 21:06 roystgnr

Yeah.... (sigh)... a bit disappointing... comparing the fastest times, init is roughly 1% slower and compute is about 15% slower for the smaller grid size and 12% slower for the larger grid size I just added. I'm inclined to agree with you that it's worth it as, at least for this potato problem, assembly is <0.5% of the total time...

Jun 12 '24 22:06 nmnobre