This PR implements a generic version of the "domain decomposition" cell system/topology that allows for load-balanced grids and repartitioning.

The load-balancing is implemented in an external library (librepa). This PR makes this library an optional dependency to ESPResSo. Additionally, a module called "GenericDD", which is a shared library is compiled. Espresso-Core depends on it. The shared library implements the new cell system. If the dependency librepa is not present, these are simply compiled to stubs that give an error. Additionally, the python interface for cell_system is changed such that it offers a "set_generic_dd" analogously to the other cell systems. The interface functionality for the generic_dd is implemented in an extra Python file generic_dd. The testsuite is changed to also test generic_dd in several smaller tests (collision_detection, pairs, random_pairs) and an additional test that simply checks if the new cell system with its different grid types and repartitionings gets the same energy in a simple NVE setting as ESPResSo's default "domain decomposition" cell system.

Example: With these chages, it is possible to do:

s = espressomd.system.System(box_l=...)
# Setup system...
dd = s.cell_structure.set_generic_dd("kd_tree", use_verlet_lists=True)
# "kd_tree" is one of the grids that librepa offers. Note "set_generic_dd" returns an object that conveniently allows you to repartition

load_metric  = dd.metric("npart")
while not done:
    s.integrator.run(1000)
    # If the maximum number of particles on any process divided by the average is greater than 1.1
    if load_metric.pimbalance() > 1.1:
        dd.repart(load_metric)

Limitations:

Only MD, no coupling possible that requires ESPResSo's default decompositions (might currently be a hard failure and not caught in the code)
Currently only fully periodic simulation boxes supported
... probably more ...

Description of changes:

Implement a new cell system "generic_dd"
Change cells.[ch]pp to properly dispatch to this cell system
Add python and script interfaces for generic_dd
Add generic_dd cell system to several existing tests

Missing:

Documentation in users guide about usage of generic_dd

Suggestions and feedback welcome.

Apr 14 '20 11:04 hirschsn

This looks very good, I'll have a look at how to deal with the limitations and review the rest. But I only will have time to give it a proper look next week. So bear with me.

Apr 14 '20 11:04 fweik

@fweik Sure, take your time.

Apr 14 '20 11:04 hirschsn

Note to self: The wait_any fix needs some work. Newer boost::mpi versions handle nonblocking communication differently and, thus, for newer boost versions waitany.hpp does not compile.

Apr 20 '20 14:04 hirschsn

@hirschsn I'm still looking into this. But there are other changes for the cell systems which improve encapsulation, which will need to merged before this. Will keep you posted...

Apr 28 '20 13:04 fweik

@hirschsn I had a first look, and I think there is one point in the design that we should consider. I think it maybe it would be better to trigger the reparting via the resort. This has the advantage that this is called regularly during the simulation (e.g. when the particle moved a certain distance), then your DD could decide internally what to do, e.g. decide based on the metric every 100 invocations, or do nothing (only manual repart) and force a repart on a global resort (those typically occur only if there are new particles or other major changed). Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does. What do you think?

May 06 '20 08:05 fweik

The idea behind triggering it manually is that I (read: anyone :D) can test different strategies with this interface; and–in fact–implement them in python in the simulation script. This might not be, what mere users of load-balancing might want, I agree.

At some point in the near future I also wanted to offer automatic capabilities, which is exactly what you are describing. Different automatic strategies could be implemented locally in generic_dd or elsewhere. The hook, however, into resort, is worth considering right now.

Do you see any problems with also offering manual repart capabilities, in addition, to let's say something like this (conceptually):

system.cell_system.set_generic_dd(..., auto_loadbalancing="npart");

Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does.

Could you elaborate? I don't get, what you want to tell me. :)

May 06 '20 08:05 hirschsn

Codecov Report

Merging #3662 into python will decrease coverage by 0%. The diff coverage is 31%.

@@           Coverage Diff           @@
##           python   #3662    +/-   ##
=======================================
- Coverage      88%     87%    -1%     
=======================================
  Files         524     532     +8     
  Lines       23471   23782   +311     
=======================================
+ Hits        20658   20742    +84     
- Misses       2813    3040   +227

Impacted Files	Coverage Δ
src/core/CellStructure.hpp	`100% <ø> (ø)`
src/core/communication.cpp	`91% <0%> (-4%)`	:arrow_down:
src/core/generic-dd/metric.cpp	`0% <0%> (ø)`
src/core/generic-dd/metric.hpp	`0% <0%> (ø)`
src/core/ghosts.hpp	`100% <ø> (ø)`
src/script_interface/generic_dd/si_generic_dd.hpp	`0% <0%> (ø)`
src/script_interface/generic_dd/si_metric.hpp	`0% <0%> (ø)`
src/core/generic-dd/generic_dd.cpp	`7% <7%> (ø)`
src/core/ghosts.cpp	`82% <12%> (-18%)`	:arrow_down:
src/core/cells.cpp	`82% <25%> (-6%)`	:arrow_down:
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8f105e3...a088aae. Read the comment docs.

May 06 '20 08:05 codecov[bot]

Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does.

I just wanted to say that you'd still have the possibility to call it manually, but I guess you can also directly do that via the python binding of generic_dd.

Do you see any problems with also offering manual repart capabilities

No I think that's fine.

system.cell_system.set_generic_dd(..., auto_loadbalancing="npart");

is about what I had in mind.

As you are saying, this can probably also be addressed later. The test failures are due to the wait_any issue you described earlier, I suppose?

May 06 '20 08:05 fweik

Test failures: Yes, I will take care of wait_any today. Also, my [[noreturn]] failed on older compilers because errexit was not [[noreturn]]. However, the osx tests do not seem to like making errexit [[noreturn]].

I am currently looking into the failing test cases and will ping you, once I'm done.

May 06 '20 08:05 hirschsn

Also, my [[noreturn]] failed on older compilers because errexit was not [[noreturn]]. However, the osx tests do not seem to like making errexit [[noreturn]].

Just a quick note: the Clang 6 jobs have been removed recently in favor of Clang 9. The osx-cuda job was removed. For AppleClang 9 on osx, I'm not sure why there's an error, it should support attributes.

May 06 '20 15:05 jngrad

AppleClang 9

That is somewhere between Clang 6 and Clang 7 if I remember correctly. AppleClang's version numbers match the Xcode major version number, not the Clang major version number.

However, even Clang 6 should have supported [[noreturn]], which was introduced in C++11.

May 06 '20 15:05 mkuron

@jngrad @mkuron You're right. This was actually a linker error. Noreturn works fine.

May 07 '20 11:05 hirschsn

Implement load-balancing for MD

Codecov Report