espresso
espresso copied to clipboard
Implement load-balancing for MD
This PR implements a generic version of the "domain decomposition" cell system/topology that allows for load-balanced grids and repartitioning.
The load-balancing is implemented in an external library (librepa). This PR makes this library an optional dependency to ESPResSo. Additionally, a module called "GenericDD", which is a shared library is compiled. Espresso-Core depends on it. The shared library implements the new cell system. If the dependency librepa is not present, these are simply compiled to stubs that give an error. Additionally, the python interface for cell_system is changed such that it offers a "set_generic_dd" analogously to the other cell systems. The interface functionality for the generic_dd is implemented in an extra Python file generic_dd. The testsuite is changed to also test generic_dd in several smaller tests (collision_detection, pairs, random_pairs) and an additional test that simply checks if the new cell system with its different grid types and repartitionings gets the same energy in a simple NVE setting as ESPResSo's default "domain decomposition" cell system.
Example: With these chages, it is possible to do:
s = espressomd.system.System(box_l=...)
# Setup system...
dd = s.cell_structure.set_generic_dd("kd_tree", use_verlet_lists=True)
# "kd_tree" is one of the grids that librepa offers. Note "set_generic_dd" returns an object that conveniently allows you to repartition
load_metric = dd.metric("npart")
while not done:
s.integrator.run(1000)
# If the maximum number of particles on any process divided by the average is greater than 1.1
if load_metric.pimbalance() > 1.1:
dd.repart(load_metric)
Limitations:
- Only MD, no coupling possible that requires ESPResSo's default decompositions (might currently be a hard failure and not caught in the code)
- Currently only fully periodic simulation boxes supported
- ... probably more ...
Description of changes:
- Implement a new cell system "generic_dd"
- Change cells.[ch]pp to properly dispatch to this cell system
- Add python and script interfaces for generic_dd
- Add generic_dd cell system to several existing tests
Missing:
- Documentation in users guide about usage of generic_dd
Suggestions and feedback welcome.
This looks very good, I'll have a look at how to deal with the limitations and review the rest. But I only will have time to give it a proper look next week. So bear with me.
@fweik Sure, take your time.
Note to self: The wait_any fix needs some work. Newer boost::mpi versions handle nonblocking communication differently and, thus, for newer boost versions waitany.hpp does not compile.
@hirschsn I'm still looking into this. But there are other changes for the cell systems which improve encapsulation, which will need to merged before this. Will keep you posted...
@hirschsn I had a first look, and I think there is one point in the design that we should consider. I think it maybe it would be better to trigger the reparting via the resort. This has the advantage that this is called regularly during the simulation (e.g. when the particle moved a certain distance), then your DD could decide internally what to do, e.g. decide based on the metric every 100 invocations, or do nothing (only manual repart) and force a repart on a global resort (those typically occur only if there are new particles or other major changed). Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does. What do you think?
The idea behind triggering it manually is that I (read: anyone :D) can test different strategies with this interface; and–in fact–implement them in python in the simulation script. This might not be, what mere users of load-balancing might want, I agree.
At some point in the near future I also wanted to offer automatic capabilities, which is exactly what you are describing. Different automatic strategies could be implemented locally in generic_dd or elsewhere. The hook, however, into resort, is worth considering right now.
Do you see any problems with also offering manual repart capabilities, in addition, to let's say something like this (conceptually):
system.cell_system.set_generic_dd(..., auto_loadbalancing="npart");
Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does.
Could you elaborate? I don't get, what you want to tell me. :)
Codecov Report
Merging #3662 into python will decrease coverage by
0%. The diff coverage is31%.
@@ Coverage Diff @@
## python #3662 +/- ##
=======================================
- Coverage 88% 87% -1%
=======================================
Files 524 532 +8
Lines 23471 23782 +311
=======================================
+ Hits 20658 20742 +84
- Misses 2813 3040 +227
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/core/CellStructure.hpp | 100% <ø> (ø) |
|
| src/core/communication.cpp | 91% <0%> (-4%) |
:arrow_down: |
| src/core/generic-dd/metric.cpp | 0% <0%> (ø) |
|
| src/core/generic-dd/metric.hpp | 0% <0%> (ø) |
|
| src/core/ghosts.hpp | 100% <ø> (ø) |
|
| src/script_interface/generic_dd/si_generic_dd.hpp | 0% <0%> (ø) |
|
| src/script_interface/generic_dd/si_metric.hpp | 0% <0%> (ø) |
|
| src/core/generic-dd/generic_dd.cpp | 7% <7%> (ø) |
|
| src/core/ghosts.cpp | 82% <12%> (-18%) |
:arrow_down: |
| src/core/cells.cpp | 82% <25%> (-6%) |
:arrow_down: |
| ... and 16 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 8f105e3...a088aae. Read the comment docs.
Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does.
I just wanted to say that you'd still have the possibility to call it manually, but I guess you can also directly do that via the python binding of generic_dd.
Do you see any problems with also offering manual repart capabilities
No I think that's fine.
system.cell_system.set_generic_dd(..., auto_loadbalancing="npart");
is about what I had in mind.
As you are saying, this can probably also be addressed later. The test failures are due to the wait_any issue you described earlier, I suppose?
Test failures: Yes, I will take care of wait_any today. Also, my [[noreturn]] failed on older compilers because errexit was not [[noreturn]]. However, the osx tests do not seem to like making errexit [[noreturn]].
I am currently looking into the failing test cases and will ping you, once I'm done.
Also, my [[noreturn]] failed on older compilers because errexit was not [[noreturn]]. However, the osx tests do not seem to like making errexit [[noreturn]].
Just a quick note: the Clang 6 jobs have been removed recently in favor of Clang 9. The osx-cuda job was removed. For AppleClang 9 on osx, I'm not sure why there's an error, it should support attributes.
AppleClang 9
That is somewhere between Clang 6 and Clang 7 if I remember correctly. AppleClang's version numbers match the Xcode major version number, not the Clang major version number.
However, even Clang 6 should have supported [[noreturn]], which was introduced in C++11.
@jngrad @mkuron You're right. This was actually a linker error. Noreturn works fine.