moose Assembly::reinitElemAndNeighbor was running inefficiently

I use moose/test/tests/dgkernels/3d_diffusion_dg/3d_diffusion_dg.i to test DG module. Firstly, print perf_log out,

 -------------------------------------------------------------------------------------------------------------------------
| Moose Test Performance: Alive time=1.4056, Active time=1.37504                                                          |
 -------------------------------------------------------------------------------------------------------------------------
| Event                                      nCalls     Total Time  Avg Time    Total Time  Avg Time    % of Active Time  |
|                                                       w/o Sub     w/o Sub     With Sub    With Sub    w/o S    With S   |
|-------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                         |
|                                                                                                                         |
| Application                                                                                                             |
|   Full Runtime                             1          0.0004      0.000441    1.3750      1.375049    0.03     100.00   |
|                                                                                                                         |
| Execution                                                                                                               |
|   computeAuxiliaryKernels()                2          0.0000      0.000000    0.0000      0.000000    0.00     0.00     |
|   computeControls()                        2          0.0000      0.000001    0.0000      0.000001    0.00     0.00     |
|   computeDiracContributions()              37         0.0002      0.000007    0.0002      0.000007    0.02     0.02     |
|   computeUserObjects()                     4          0.0011      0.000270    0.0011      0.000270    0.08     0.08     |
|   compute_jacobian()                       2          0.0832      0.041596    0.0832      0.041597    6.05     6.05     |
|   compute_residual()                       35         1.2410      0.035457    1.2412      0.035464    90.25    90.27    |
|   solve()                                  1          0.0042      0.004165    1.3286      1.328598    0.30     96.62    |
|                                                                                                                         |

Next, Disable DGKernel

#[DGKernels]
#  active = 'dg_diff'
#  [./dg_diff]
#    type = DGDiffusion
#    variable = u
#    epsilon = -1
#    sigma = 6
#  [../]
#[]

Print perf_log out again,

 -------------------------------------------------------------------------------------------------------------------------
| Moose Test Performance: Alive time=0.102042, Active time=0.071381                                                       |
 -------------------------------------------------------------------------------------------------------------------------
| Event                                      nCalls     Total Time  Avg Time    Total Time  Avg Time    % of Active Time  |
|                                                       w/o Sub     w/o Sub     With Sub    With Sub    w/o S    With S   |
|-------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                         ||                                                                                                                         |
| Application                                                                                                             |
|   Full Runtime                             1          0.0005      0.000450    0.0714      0.071385    0.63     100.01   |
|                                                                                                                         |
| Execution                                                                                                               |
|   computeAuxiliaryKernels()                2          0.0000      0.000000    0.0000      0.000000    0.00     0.00     |
|   computeControls()                        2          0.0000      0.000000    0.0000      0.000000    0.00     0.00     |
|   computeDiracContributions()              7          0.0000      0.000005    0.0000      0.000005    0.05     0.05     |
|   computeUserObjects()                     4          0.0011      0.000269    0.0011      0.000269    1.51     1.51     |
|   compute_jacobian()                       1          0.0058      0.005840    0.0058      0.005841    8.18     8.18     |
|   compute_residual()                       6          0.0268      0.004474    0.0269      0.004480    37.61    37.66    |
|   solve()                                  1          0.0013      0.001310    0.0340      0.034032    1.84     47.68    |
                                                                                                                 |

Compare cost of compute_residual calling once in two cases DGkernel is on: 0.035457 DGKernel is off: 0.004474

10x cost here MUST be unusual._

Track computation cost, The culprits is_assembly[tid]->reinitElemAndNeighbor(elem, side, neighbor, neighbor_side); inFEProblem::reinitNeighbor`. Pleas take a look.

Apr 19 '16 03:04 lw4992

It's not "inefficient"... it does a LOT of work and gets called A LOT (every side of every element in the mesh).

Basic breakdown of what it does:

Evaluate a qrule on the side of this element.
Evaluate all of the shape functions for every variable at those quadrature points
Evaluate all of the variables and gradients on those quadrature points
Map the reference points on this element into physical space
Map the physical space points into the reference space on the neighbor element
Evaluate all of the shape functions for every variable at those reference space points in the neighbor
Evaluate all of the variables and gradients at those reference space points

That is a LOT of work. And it will be happening 6 times for every hex in the mesh (each side...). You showed that the simulation takes 10x longer when using DG... that sounds almost exactly right.

DG is expensive! Lots of work to do!

Apr 19 '16 16:04 friedmud

I am wondering if we can get rid of Step 4 and 5, how much time we can save.

Apr 19 '16 16:04 YaqiWang

There is no getting rid of it... for multiple reasons. There is no other way to know where to evaluate the shape functions on the neighbor side.

Also: Don't forget about mesh adaptivity... that makes things insanely complicated... but it's all dealt with by this system.

Apr 19 '16 16:04 friedmud

Mapping from the unique quadrature points on the side of two attaching elements is simpler than mapping from one side to another I think. With mesh adaptivity, we need the prolongation and restriction, which can probably be done efficiently based on the patten of local refinement. The current design is not optimal IMHO.

Apr 19 '16 17:04 YaqiWang

No - quadrature points on the side of the element are almost never a subset of the quadrature on the element! With Gauss they never are!

With mesh adaptivity you need more than that - you have to handle integration of DG terms on the "pieces" of sides that have hanging nodes. The current scheme handles that perfectly.

As far as I know there is no better way to do this with unstructured, adaptive mesh. We do not know where the quadrature points are... or how they fit into the reference space of the neighbor.

I'm willing to hear a proposal though.

On Tue, Apr 19, 2016 at 1:02 PM Yaqi [email protected] wrote:

Typically quadrature points on a side of an element is just a subset of the quadrature points in the element. If this is the case we do not need to do the expensive mapping. With mesh adaptivity, we need the prolongation and restriction, which can probably be done efficiently based on the patten of local refinement. The current design is not optimal IMHO.

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/6788#issuecomment-212017561

Apr 19 '16 22:04 friedmud

Ah - I was responding from email - now that I come to Github I see that you've changed your text a bit. Sneaky! 😄

Apr 19 '16 22:04 friedmud

Yes, I realized what I said is wrong immediately after I hit the comment button. hehe. Actually I just had an idea, getting solution DoFs on a side from the DoFs of the element is much easier especially with nodal shape functions. So we can obtain these two sets of DoFs first and then evaluate the values on the quadrature points of the side.

Apr 19 '16 22:04 YaqiWang

Getting the dofs is a small part of the work. The big chunk of the work is evaluating the FE shape functions on both sides... and there's no way around that.

Apr 19 '16 22:04 friedmud

@lw4992 - What exactly do you want to see done here? For any one given problem there are lot of optimizations that can be made for that one problem. It's difficult however to come up with algorithms that will work across a wide variety of problems. As Derek says, what we have in there is necessary for the general case completely unstructured problem with refinement and all the bells and whistles. If you have a specific problem you want to work on you may be able to find ways to speed things up and if you ignore everyone else's interests you can probably through away large parts of the framework to really make things scream. We have a guy working on DG in our group and he's taking a hard look at a lot of the things we do. It's definitely not easy to make it general and fast.

Apr 20 '16 13:04 permcody

The real way to make an impact on the speed of DG is not by changing MOOSE, but by optimizing the FE stuff in libMesh: https://github.com/libMesh/libmesh/issues/890

That is currently ongoing work and could have large impacts on the speed of DG (as DG is very dependent on the various FE mapping mechanisms and FE evaluation).

I would say that there is currently nothing to be done in MOOSE that could make this faster and this ticket can be closed.

Apr 20 '16 13:04 friedmud

Thanks for your patience. DG is very dependent on FE mapping like Derek says. Especially mesh is on adapative and deforming, element and neighbor must be re-mapping on quadrature points of intersides. No free for lunch. Hope this issue can keep open with libMesh/libmesh#890. They are same issue.

Apr 21 '16 17:04 lw4992

So @yidongxiainl ran his DG stuff through a profiler and we found out that MOOSE is computing neighbor element volume and that is not really being used. That computation is quite expensive, since the actual code builds a temporary FE object, puts a qrule on it and computes the volume. I have a patch that makes computation of this optional, i.e. users subscribe to it via an API call if they really need it.

This way we saved about 47% of the runtime on one of our test cases.

Oct 25 '16 20:10 andrsd

On my laptop, the 3d_diffusion_dg.i:

version	time [secs]
old version	1.3
not computing neighbor volume at all	0.47
computing neighbor volume using member variables	0.67

Seems like an improvement ;-)

Oct 25 '16 21:10 andrsd

I guess I was tired of arguing at that time. But FE can and should be only evaluated once on the side! You get DoFs of current element and neighbor for the sharing side and then you do one FE initialization on the side with two sets of DoFs. Adaptation and deformation is not an issue because the element follows a pattern to be refined which makes getting DoFs simple. The fact that you are not doing it does not mean it cannot be done. This design almost makes DG stuff in MOOSE useless.

Oct 26 '16 17:10 YaqiWang

@friedmud Updated the timing table with numbers produced by 1bf23cd

Oct 26 '16 18:10 andrsd

@YaqiWang My optimization has nothing to do with DoFs and sides...

Oct 26 '16 18:10 andrsd

@andrsd Yes, I know. I ran DG diffusion long ago and it is hundred times slower than CG diffusion and then @friedmud removed one volume evaluation on the current element which brought the factor down to about 10. This is indeed a useful improvement. Considering the averaged node connectivity to elements is about 4 in 2D, we are making DG comparable (CPU-time factor less than 10) with CG. We possibly cannot squeeze more for CPU time without redesign. DG can support high polynomial order easily, while we only have quadratic with CG. It is unfortunate we do not support DG very well.

Oct 26 '16 18:10 YaqiWang

But FE can and should be only evaluated once on the side!

I don't understand this assertion. Even for a L2_LAGRANGE basis the local node numbers relevant to computation of the FE values on the face are different between the "element" and "neighbor". I suppose that you could do something like Elem::local_node to pair FE evaluations from the element side with local dof indices on the neighbor side. But this seems only applicable to L2_LAGRANGE; it wouldn't extend to MONOMIAL. I'd need @roystgnr to comment on L2_HIERARCHIC or XYZ

Apr 26 '24 13:04 lindsayad

Surprised that @lindsayad is commenting here. I think I was suggesting getting dofs on a side from dofs of its element can typically be done very efficiently. For example of L2_LAGRANGE, we can just pick the dofs. I know for some hierarchical shape functions defined with nodes, sides and interior, we can do the pick as well. Another thing we will have to deal with is the side orientation, in 1D, the side is a node, we do not need to worry orientation; in 2D, side is an edge, the orientation is simple as well; in 3D, orientation will be a lot complicated. I was hoping to have an operation like std::vector<Real> get_side_dofs(const Elem * elem, unsigned int side_id, unsigned int var_id, unsigned int sys_id, NumericVector & sol) and rotate_dofs(const Elem * side_elem, int rotation_index, std::vector<Real> & dofs).

Apr 26 '24 16:04 YaqiWang

With L2_HIERARCHIC the shape functions between an element and its neighbor on the shared side are permutations of each other; with XYZ the situation is as bad as with MONOMIAL.

Apr 30 '24 13:04 roystgnr

moose moose copied to clipboard

Assembly::reinitElemAndNeighbor was running inefficiently

10x cost here MUST be unusual._

moose
moose copied to clipboard