dealii
dealii copied to clipboard
DoFHandler index handling
I plan to make a rather fundamental change in the way indices are stored in DoFHandler. The ultimate goal is to make the index storage in DoFHandler compatible with what we have identified as highly useful in the MatrixFree vector access classes, i.e., direct access to vector entries and ideally vectorized access. In other words, we would like to avoid copy many things in the DoFInfo object of MatrixFree and also make it more natural to implement matrix-free multigrid transfer. This topic is open for discussion and I plan to hear some opinions before actually implementing the changes proposed here. I hope that we find a design that will be suitable for the next say 5-10 years and the respective hardware with state-of-the-art performance.
The main characteristics and changes would be as follows:
- Indices are stored in MPI-local index space. DoFHandler will keep a
Utilities::MPI::Partitioner
object to translate between the local index space and the global index space.DoFAccessor::get_dof_indices
will do this translation on the fly. - The current implementation stores the indices twice, once in the various
internal::DoFHandler::DoFLevel<dim>
(for lines, quads, hexes) andstd::vector<types::global_dof_index>
(for vertices) objects (this is the format for initializing the degrees of freedom). The storage is done inside the DoFObject class. Then there is the fast-access path provided byDoFLevel<dim>::cell_dof_indices_cache
that we always use at the end of the day. The plan is to only create the level stuff for the distribution of degrees of freedom and later only keep the 'cache'-type variable that stores per-cell data. (The thing to look at is what should happen when we do something like cell->face(f)->get_dof_indices().) - Native support for the hp case. (In other words, remove
hp::DoFHandler
.) - No degrees of freedom are stored on artificial cells (whereas we now have
invalid_dof_index
entries there). I currently favor to realize this by assumingFE_Nothing
for the purpose of the index creation atdistribute_dofs
.
In addition to these design decisions, there are a few optimization options that I would really like to include:
- Storage of indices is separated by 'component' or 'block'. The rationale is to make reading vectors easier by operating on a scalar FE (or an FESystem with only one base element). The standard case would keep indices intermingled in the usual FESystem setting but we would allow reading from one base elements inside an FESystem.
- To allow for tight communication in a vector depending on the application, we would group the local indices in four groups (regarding how the local numbers are ordered):
- locally owned indices
- indices needed for working with locally owned cells
- indices on two cells adjacent to 'locally owned faces' (i.e., the DG face integrals you would do on a certain processor).
- all other ghost indices not covered by i-iii All this is handled inside MPI::Partitioner, so no need to manually fiddle around with the details except at the place where we create this data.
- Index storage is lexicographic (data grouped along one 'line' at a time) rather than hierarchical (first all vertices, then all lines, then all quads, then all hexes).
- Compression of indices on a per-cell basis. In DG, for example, one generally keeps all DoFs on a cell contiguous and only one number needs to be stored. For continuous elements, compression down to
2^dim
indices per cell (in the standard orientation case) or even2^dim
indices per parent cell (keeping the indices for the children) for the MG levels > 0 can be conceived. Since the possibility for compression depends on the numbering applied by the user, we would have an enum with the selected compressions and store this per-cell (or per patch of cells, see below). - Option to not assign and carry constrained degrees of freedom at hanging nodes and periodic boundaries. The rationale is that only the coarse side carries degrees of freedom and the hanging node constraints are an interpolation matrix from the fine to the coarse side that is applied before writing or after reading the solution vector with
local_dof_indices
. (This should even be possible in the hp-adaptive case but I have not really thought about the details yet.) This change is probably most intrusive of my plan because it would change the number of degrees of freedom returned by the DoFHandler. Moreover, theget_dof_indices
call would return some 'strange' indices belonging to the parent face of the neighbor. When extracting cell data from a vector, the functionDoFCellAccessor::get_dof_values
would need to apply the transformation matrix. The advantage of removing these constraints is that we get fewer degrees of freedom throughout the whole program (without the drawback of the approach where the compression is done at the stage of the solution of the linear system that needs to copy matrices & vectors and is discouraged because of index confusion) and one can skipConstraintMatrix::distribute
for the hanging node case. Note that I do not propose to skip ConstraintMatrix because we would still want to keep different boundary conditions on the same DoFHandler. But the natural constraints (periodicity, hanging nodes) would be automatically fulfilled for all vectors associated with the DoFHandler. - Array-of-struct-of-array storage layout where the indices to DoFs for several cells are at the innermost level. The number of cells would be determined at configuration time (say the size of the largest vectorized array on a CPU, the number of so-called lanes, or 64/128 on a GPU, depending on the desired balance). This storage scheme would be useful for vectorized computations and on GPUs where one reads multiple elements at a time (gather/scatter instructions). To be useful, it would need to be combined with appropriate numbering of degrees of freedom because the standard numbering puts indices from different elements in memory too far apart and gatter/scatter is typically only useful when you access significantly less cache lines (say 3) than you have lanes (say 16). The nice thing is that this approach is compatible with hanging nodes (tested on a GPU implementation where the penalty from hanging nodes is really low). But it makes reading more complicated in case the access is not in terms of the vectorized access pattern, say for face integrals. (I guess one would need to look at the final implementation to see if it is worth the pain.)
Are there any things someone else would like to include? Any problems that we could meet when implementing these changes?
Many good ideas! I don't see a problem in general, but I can think of a couple of things to consider:
- Functional style. It is worth considering a more functional style (objects are generally read-only) to allow for easier multi-threading. Renumbering DoFs might return a renumbered copy of the DoFHandler for example.
- Triangulation? Your suggestions sound like it works without any changes to the Triangulation. Maybe we should consider looking at Triangulation first? We could get rid of artificial cells altogether for example. I think we made a long list of things that we wanted to change there. Things I can remember right now: separation of dim and structdim, transparent handling of parallel meshes, Views to iterate over certain subsets.
- Flexibility of container. I have been wondering for a while how difficult it would be to allow a DoFHandler to only act on a part of Triangulation or maybe on a specific codim of a Triangulation. This would allow for easy domain decomposition-like setups to solve multi-physics problems. You can use the same feature to get rid of artificial cells for example.
- Multigrid. I wonder if it makes sense to keep the current structure or not. I was discussing with Guido if it makes sense to have one DoFHandler per level instead of a single DoFHandler for the whole hierarchy. This would simplify assembly, system setup, etc.. For that to work one would need the option to distribute DoFs on a certain level (instead of active cells) and then some way to translate between DoFHandlers.
- Do you want to continue to support anisotropic refinement? I think it is a valid question to consider how many people are using this feature. Supporting this in combination with hp is certainly a headache.
- Talk with @Rombur about the linear algebra rework. I am not sure about the backwards compatibility story. Maybe you can only support the new vector format?
- FE flexibility. Does it make sense to allow storing an arbitrary number of unknowns for each object (vertex, face, ...)? We currently have a fixed number of unknowns per object on each cell, which does not allow for some exotic elements (something like a MAC scheme comes to mind).
Your suggestions make a lot of sense.
Triangulation? Your suggestions sound like it works without any changes to the Triangulation.
I know that we want to change some things in the triangulation but I think that most of the things I mention above are somewhat orthogonal to the grid. Sure, what I plan for artificial cells is not, but that is so tiny a part compared to the other changes that I don't think it will duplicate lots of work.
Functional style. It is worth considering a more functional style (objects are generally read-only) to allow for easier multi-threading.
With these plans the DoFHandler object would become some sort of container that holds a few of 'functional-style' object that each hold the actual information regarding degrees of freedom. So the implementation I had in mind actually goes into that direction but does not (yet) break the current model. If we wanted to break the current model, my work would really help in that direction. The style I have in mind is the following:
- Each base element of an FESystem (well, it's actually the leafs that steer it) gets its own 'DoFIndexCache' container.
- Apply the same numbering of cells for all base elements and you can extract the information about the compound system easily
- The active level gets one and each level gets one in case one does multgrid.
- For multigrid, there would be some cross links between children and parents.
It seems that our views are not too far apart.
Anisotropic refinement
I personally never use it because it cannot be combined with p4est and you can only do DG with it (you cannot build hanging node constraints on 3D anisotropically refined meshes). I see some theoretical value in it and there might very well be people who use it. But it would be low on my personal (and my students') priority list.
Flexibility of container. I have been wondering for a while how difficult it would be to allow a DoFHandler to only act on a part of Triangulation or maybe on a specific codim of a Triangulation.
That would be a cool feature. While not my principal goal I don't think it would be difficult. To realize such a structure, we would need to associate DoFHandler objects with an iterator over parts of the mesh (it would be some vector of 'objects' that define those; @luca-heltai and @guidokanschat have plans for a new grid interface and associated data structures). With respect to my plans, I have already been thinking about data structures that are initialized this way, for the initial implementing using simply the non-artificial active cells (or non-artificial level cells).
FE flexibility. Does it make sense to allow storing an arbitrary number of unknowns for each object (vertex, face, ...)?
If you can do hp-adaptivity then it is a minor step to allow for things like MAC I assume. Of course the question is about the efficiency losses if you allow for full flexibility. My approach would be to have patches of cells to carry the same information (this gives you the benefit of nice base+offset
data access patterns for the bulk of accesses) but allow for different properties on different patches using some more unstructured/indirect addressing mode.
Anisotropic refinement
I personally never use it.
Same here for similar reasons. I also don't believe in anisotropic refinement. ;-) The question is if you want to support it (I have no idea how difficult that is going to be) or if we decide that we want to rip it out.
If you can do hp-adaptivity then it is a minor step to allow for things like MAC I assume.
Maybe the MAC scheme is not a good enough example, because you can implement it without distributing a different number of unknowns on different faces by combining both velocities. Just to be clear: I was referring to FEs with x unknowns on face 1 and y unknowns on face 2 in a single cell.
I was referring to FEs with x unknowns on face 1 and y unknowns on face 2 in a single cell.
Ah I see, I didn't think of it this way. I agree that the DoFHandler should support these types of elements. Some unstructured/arbitrary numbering inside the elements would be the approach to use. That would fit with some more 'arbitrary' finite element class since current FiniteElement classes use a simple dpo (dofs per object) vector.
Just a clarification: Even though I like to use the additional structure in problems where we can (specialized structures to gain in efficiency) I certainly do not want to undercut flexibility where it exists and is useful.
Martin -- nice plan, I'm looking forward to the implementation!
The main characteristics and changes would be as follows:
- Indices are stored in MPI-local index space. DoFHandler will keep a |Utilities::MPI::Partitioner| object to translate between the local index space and the global index space. |DoFAccessor::get_dof_indices| will do this translation on the fly.
I like this scheme. But how will you deal with the situation where the local DoFs don't form a contiguous range? That's the common case for parallel computations with multiple blocks.
- The current implementation stores the indices twice, once in the various |internal::DoFHandler::DoFLevel
| (for lines, quads, hexes) and |std::vectortypes::global_dof_index| (for vertices) objects (this is the format for initializing the degrees of freedom). The storage is done inside the DoFObject class. Then there is the fast-access path provided by |DoFLevel ::cell_dof_indices_cache| that we always use at the end of the day. The plan is to only create the level stuff for the distribution of degrees of freedom and later only keep the 'cache'-type variable that stores per-cell data. (The thing to look at is what should happen when we do something like cell->face(f)->get_dof_indices().)
That's clever: use the existing data structures and algorithms only to build the enumeration, then copy into the caches and throw away the former. I used to think that the caches cost too much memory, but I've come around to only care about problems like Stokes in 3d any more and there the matrix is so immensely expensive that everything else doesn't matter. For simple problems who cares about memory -- you can solve them to better accuracy today than anyone needs.
- Native support for the hp case. (In other words, remove |hp::DoFHandler|.)
That's going to be interesting and challenging. I wanted to implement the
distributed hp case in the near future. I think it would be interesting to see
first whether the hp::DoFHandler
is slower than the ::DoFHandler
. I always
thought that it is, but I don't have any actual numbers for, say, step-6 or
step-22.
The tricky part of combining these features is that currently the two classes
have functions with the same name that return different things. Specifically,
get_fe()
is a problem. We eventually need to resolve this, for example by
adding get_fe_collection()
to ::DoFHandler
, renaming
hp::DoFHandler::get_fe()
to hp::DoFHandler::get_fe_collection()
, and
letting hp::DoFHandler::get_fe()
return something only if the object really
only works with a single finite element. This of course could be done long
before your work, in preparation of later work.
- Storage of indices is separated by 'component' or 'block'. The rationale is to make reading vectors easier by operating on a scalar FE (or an FESystem with only one base element). The standard case would keep indices intermingled in the usual FESystem setting but we would allow reading from one base elements inside an FESystem.
Do you plan to make this required? I wouldn't like this because it precludes users from choosing whatever renumbering of DoFs they want to choose.
- Compression of indices on a per-cell basis. In DG, for example, one generally keeps all DoFs on a cell contiguous and only one number needs to be stored.
Is this worth it? It requires a number of if-elses, while the memory savings are likely small compared to even the local (dense) matrix.
- Option to not assign and carry constrained degrees of freedom at hanging nodes and periodic boundaries.
I've given this considerable thought over the years, and I don't like it. To me, the finite element method draws a significant part of its power from the fact that you do the same thing on every cell, without having to look to other cells (neighbors, etc). This makes things efficient because you hardly ever have an if-statement in any local operation. I consider that a very deliberate feature of the method and I'm reluctant to give it up.
You'll also run into trouble in 3d where you get constraints from cells that touch at an edge but that are not face neighbors. These cases are difficult to resolve if you only work on a cell and need to look around.
I've reviewed a number of very large patches recently and found it extraordinarily difficult to make progress with them. It also led to long delays for the patch author because I simply didn't have the time to spend a half or full day on the patch. Once revisions to the dozens (in some kinds >100) comments were made, we had to repeat the cycle and since it had been a few weeks since the last round, I basically had to start reading everything from scratch again.
I think however you want to approach all of this, it would be easiest if you could structure it as a sequence of individual pull requests that each morph the code into the right direction a little bit, rather than the ONE BIG PATCH that does it all. This approach may overall take a bit more time (because at times you have to wait for one patch to get approved before you can really get started on the next piece) but it reduces the strain on everyone to get this all through peer review. It also gives you early feedback on design decisions that may unexpectedly turn out to be controversial.
I think however you want to approach all of this, it would be easiest if you could structure it as a sequence of individual pull requests that each morph the code into the right direction a little bit, rather than the ONE BIG PATCH that does it all.
I will certainly try to do that. There are a number of things that can be done independently, such as getting rid of the DoFLevels after initialization (and getting everything through the cache), the hp capabilities, or the localized indices.
The tricky part of combining these features is that currently the two classes [
DoFHandler
andhp::DoFHandler
] have functions with the same name that return different things. Specifically,get_fe()
is a problem. We eventually need to resolve this, for example by addingget_fe_collection()
to::DoFHandler
, renaminghp::DoFHandler::get_fe()
tohp::DoFHandler::get_fe_collection()
, and lettinghp::DoFHandler::get_fe()
return something only if the object really only works with a single finite element. This of course could be done long before your work, in preparation of later work.
I think we agree on introducing DoFHandler::get_fe_collection
to return the whole collection and make DoFHandler::get_fe
return the zeroth element in the collection (to get the single element of a non-hp element). But for backward compatibility we don't necessarily need to change anything in hp::DoFHandler
I think. We could do as we did when getting rid of MGDoFHandler: When DoFHandler
is ready, we derive hp::DoFHandler
from DoFHandler and define hp::DoFHandler::get_fe()
by DoFHandler::get_fe_collection
(in the MGDoFHandler case we defined distribute_dofs
as comprising both distribute_dofs
and distribue_mg_dofs
from the base DoFHandler). I have not checked whether the accessors pose a problem, though.
Indices are stored in MPI-local index space. DoFHandler will keep a |Utilities::MPI::Partitioner| object to translate between the local index space and the global index space. |DoFAccessor::get_dof_indices| will do this translation on the fly.
I like this scheme. But how will you deal with the situation where the local DoFs don't form a contiguous range? That's the common case for parallel computations with multiple blocks.
This is why I would split the index part into different objects and combine them for get_dof_indices
. Since I will follow your suggestion of splitting the project into several PRs we can analyze this in isolation (and at an early stage) to see if it is worth the pain, also in relation to your other comment with respect to obligatory splitting into blocks.
Compression of indices on a per-cell basis. In DG, for example, one generally keeps all DoFs on a cell contiguous and only one number needs to be stored.
Is this worth it? It requires a number of if-elses, while the memory savings are likely small compared to even the local (dense) matrix.
For matrix-based computations, I would agree. The interesting effect is that you can do vectorized read operations from/to vectors when you know that go get a contiguous block. I have come to the point where I don't bother about if-else statements in terms of performance (maybe code complexity, but that has to be evaluated when the patch is on the table). I think that if/else statements (aka conditional branches) that are easily predicted and can be vectorized over cost you less performance than memory accesses nowadays. I don't think future hardware will change that. This particular check would only be evaluated once per cell (or patch of cells), and we have plenty of if-else statements in loops that are evaluated thousands of times more often (think about FEValues::shape_value for example).
Option to not assign and carry constrained degrees of freedom at hanging nodes and periodic boundaries.
I've given this considerable thought over the years, and I don't like it. To me, the finite element method draws a significant part of its power from the fact that you do the same thing on every cell, without having to look to other cells (neighbors, etc). This makes things efficient because you hardly ever have an if-statement in any local operation. I consider that a very deliberate feature of the method and I'm reluctant to give it up.
The nice thing is that you wouldn't look into the neighbor and actually reduce the number of if/else statements by keeping an index to a transformation matrix that transforms the local vector or matrix. Note that currently we check for each row/column if we need to resolve a constraint. But let's not discuss this in more detail before there is a specific patch (and this is one of the late patches in this endeavor). I have to think about all the cases that need to be considered, therefore I appreciate your comment regarding the edge in 3D.
On 12/25/2015 05:49 AM, Martin Kronbichler wrote:
Indices are stored in MPI-local index space. DoFHandler will keep a |Utilities::MPI::Partitioner| object to translate between the local index space and the global index space. |DoFAccessor::get_dof_indices| will do this translation on the fly. I like this scheme. But how will you deal with the situation where the local DoFs don't form a contiguous range? That's the common case for parallel computations with multiple blocks.
This is why I would split the index part into different objects and combine them for |get_dof_indices|. Since I will follow your suggestion of splitting the project into several PRs we can analyze this in isolation (and at an early stage) to see if it is worth the pain, also in relation to your other comment with respect to obligatory splitting into blocks.
Yes, that sounds good. I'm looking forward to seeing how you go about this. Feel free to open an issue to first discuss the design for this before spending your time on actual code.
Compression of indices on a per-cell basis. In DG, for example, one generally keeps all DoFs on a cell contiguous and only one number needs to be stored. Is this worth it? It requires a number of if-elses, while the memory savings are likely small compared to even the local (dense) matrix.
For matrix-based computations, I would agree. The interesting effect is that you can do vectorized read operations from/to vectors when you know that go get a contiguous block. I have come to the point where I don't bother about if-else statements in terms of performance (maybe code complexity, but that has to be evaluated when the patch is on the table).
That's a fair point. To be more specific, I don't care about if-else
statements inside get_dof_indices()
or get_dof_values()
. I would care very
much if they propagated into user codes. I think you are not actually suggest
the latter, though.
Option to not assign and carry constrained degrees of freedom at hanging nodes and periodic boundaries. I've given this considerable thought over the years, and I don't like it. To me, the finite element method draws a significant part of its power from the fact that you do /the same thing on every cell/, without having to look to other cells (neighbors, etc). This makes things efficient because you hardly ever have an if-statement in any local operation. I consider that a very deliberate feature of the method and I'm reluctant to give it up.
The nice thing is that you wouldn't look into the neighbor and actually reduce the number of if/else statements by keeping an index to a transformation matrix that transforms the local vector or matrix. Note that currently we check for each row/column if we need to resolve a constraint. But let's not discuss this in more detail before there is a specific patch (and this is one of the late patches in this endeavor). I have to think about all the cases that need to be considered, therefore I appreciate your comment regarding the edge in 3D.
OK. Same here, feel free to open an issue specific to this before writing code, if you want early feedback.
We should get this on our agenda again, but we won't do it for the 9.3 release.
I think we should try to get some solution ready for the next release. With 6.5 years passed since the issue was opened, I think it is time to reconsider our options and reflect on what has currently been achieved:
- We have unified the hp capabilities into the standard
DoFHandler
structure. - We have unified more of the multigrid capabilities, even though the index storage and access to
get_dof_indices
/get_mg_dof_indices
is still quite different internally. - We have the ability to run global-coarsening multigrid, where we can have multiple
DoFHandler
objects on differently partitioned meshes, and can even tohp
adaptivity. - We have considerably expanded the capabilities of the
MatrixFree
framework, where we have added most specializations regarding performance listed in the top post. - However, one thing that has turned out to be more important than anticipated is the cost of ghost exchange in matrix-free computations. Despite a direct-array index access via
MatrixFree
/Utilities::MPI::Partitioner
, that cost is non-negligible, because memory transfers are expensive. While some multiple-layer access into indices as anticipated (locally owned, those needed for cell integrals on owned cells, those needed for face integrals, all locally relevant indices) might still be attractive, I believe it belongs into the matrix-free infrastructure and not the generalDoFHandler
. - One re-occurring topic is whether the generic
DoFHandler
should include anIndexSet
for thelocally_relevant_dofs
, or anUtilities::MPI::Partitioner
with the relevant dofs, as this is queried in multiple places and might offer some benefit. It is however not as important as the items listed below.
With the work #13947 and preparatory work on the triangulation via #13924 (and some follow-up PRs that I plan), I believe that one of my main bullet points, the fact that DoFHandler
stores information twice, once on the levels objects and once in the cell_dof_indices_cache
, might be resolved, but differently to what I anticipated. When I combine all improvements I have done in these mentioned PRs and some additional local changes, the access via the levels, bypassing the cache, is now taking around 3x the instruction count of the dof indices cache. To give numbers, the dof indices cache spends 6 x86-64 instructions on it (one for the load, one for the store, one each for incrementing the two pointers, one cmp
, and one to continue the for
loop), whereas I counted 19 on my branch). While this might sound like a significant slow-down, I think it is not: If we want to use those global indices to write into a vector, the index translation from global to MPI-local indices takes about the same count, possibly more, depending on the finite element. So in most cases the increased cost will be in the noise. And all cases where the dof_indices
access is critical is better covered by MatrixFree
infrastructure, as MPI-local indices are vital to get any kind of performance. In other words, I believe it makes sense to duplicate that information not in the generic DoFHandler
, but for the use cases via the MatrixFree/DoFInfo
path which duplicates indices already today. (This includes multigrid where we currently duplicate things, but that is a side project.)
I will work towards adding all performance improvements to the level objects in DoFHandler
in a few upcoming PRs and link them to this issue, and then we can decide about deleting the cell_dof_indices_cache
, which then will only consume memory without real functionality. If we have benchmarks suggesting the opposite, i.e., showing that we really should keep the cache, we can still benefit from the work by faster initialization.
Let me close this update with a comment on a few of the other topics I mentioned in my original post:
- I think periodic boundary conditions can and should be resolved in the generic
DoFHandler
index distribution process: We should give the user the option to disable it, but by default I would just eliminate those numbers during construction. I think we have most infrastructure in place to identify them easily when we set up the levels. Of course, this is only realistic when we have simple transformations, otherwise we would fall back to the old case. - Hanging nodes and other constraints are difficult to eliminate. With the experience how long it has taken to equip
MatrixFree
with the fast hanging nodes algorithm, and how challenging it would be to automatically resolve the constraints in both direction as we access the vectors inget_interpolated_dof_values
or write into vectors/matrices, I believe this would be too much in generic codes. However,MatrixFree
does have most facilities in place to avoid adding those unknowns to the vectors for routines that would care (say the level of linear solvers). With #13628, we can move all constrainedDoF
s to the end of the locally ownedDoF
s. From that point, the most feasible strategy is to create aLinearAlgebra::distributed::Vector::View
or something similar that runs all vector updates, dot products, and similar only on the principal part, avoiding to do any work on the rest. So I see this now as a question to be limited toMatrixFree
and multigrid. - Compression of indices, lexicographic numbering, and all similar capabilities belong to the
MatrixFree
framework, which can extract benefits from it, but not the genericDoFHandler
. Not surprisingly, we support the case case ofarray-of-struct-of-array
and similar concepts there, and might expand more on those in the future. Also, theDoFInfo
setup is per multigrid level and more or less a separated container, as suggested by https://github.com/dealii/dealii/issues/2000#issuecomment-166305652.
Overall, I think we converged to a pretty good solution. Given the text here, I believe that we should aim for the performance improvements suggested above, see whether we can then eliminate the cell_dof_indices_cache
at that stage (or have good reasons not to do so), implement the periodic boundary condition fix (or have good reasons not to do so), and then close this issue. I hope we can achieve that in the next two months.
Thinking about this issue now again, I think #14068 has completed this long-standing issue. I think the periodicity constraints can be kept in MatrixFree
as well, so issue #15224 is all that is left from this one here.