espresso
espresso copied to clipboard
Interface Walberla LB
Current status:
- All LB features except pressure tensor are in
- Pressure tensor can be merged as soon as corresponding PR in PbmPy and Walberla are merged
- Electrokinetics is WIP and tracked in a separate ticket
- Documentation is not yet adapted
- GPu support and support for switching accuracy to single precision on the CPu is not yet done.
The current plan is to merge this after the release of 4.2
Codecov Report
Merging #2701 into python will decrease coverage by
7%. The diff coverage is38%.
@@ Coverage Diff @@
## python #2701 +/- ##
========================================
- Coverage 89% 81% -8%
========================================
Files 557 559 +2
Lines 24326 25963 +1637
========================================
- Hits 21698 21100 -598
- Misses 2628 4863 +2235
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/core/MpiCallbacks.hpp | 97% <ø> (ø) |
|
| src/core/communication.hpp | 100% <ø> (ø) |
|
| src/core/electrostatics_magnetostatics/coulomb.cpp | 79% <ø> (-1%) |
:arrow_down: |
| ...d_based_algorithms/FluctuatingMRT_LatticeModel.cpp | 0% <0%> (ø) |
|
| ...rid_based_algorithms/FluctuatingMRT_LatticeModel.h | 0% <0%> (ø) |
|
| ...based_algorithms/LbWalberlaD3Q19FluctuatingMRT.hpp | 0% <0%> (ø) |
|
| ...ore/grid_based_algorithms/lb_particle_coupling.hpp | 100% <ø> (ø) |
|
| .../grid_based_algorithms/lbboundaries/LBBoundary.hpp | 73% <ø> (-27%) |
:arrow_down: |
| src/core/grid_based_algorithms/philox_rand.h | 0% <0%> (ø) |
|
| src/core/grid_based_algorithms/lb_interface.cpp | 36% <45%> (-35%) |
:arrow_down: |
| ... and 54 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 95a9464...379b1d3. Read the comment docs.
Implementation status of Espresso's lb interface is here: https://github.com/espressomd/espresso/wiki/Walberla_Integartion
@fweik, this could be a starting point for integrating boundary support. Particle coupling only works on a single core for now, if skin>0
The LbWalberla class has
get_node_velocity_at_boundary(const Utils::Vector3i &node) const; bool set_node_velocity_at_boundary(const Utils::Vector3i node, const Utils::Vector3d v); bool remove_node_from_boundary(const Utils::Vector3i &node); all of which are parallel calls. Velocities are in LB units, indices are global.
@mkuron, there still seems to be an issue with the ghost communication. Could you please take a look at the constructor of LbWalberla? The intention was to have a ghost layer of size int(skin/agrid +1) and to ghost-communicate the full set of populations. Then, velocity interpolation of particles outside the box domain and on ghosts should work. Unfortunately, that is not what happens. The lb_momentum_conservation test cannot be run on more than one node.
The communication looks fine. Are you sure it's not the domain decomposition where the error is coming from? If PBCs work, then the communication is also working.
@mkuron, I investigated the Nan-proliferation from Walberla's UBB() further. It only appears, when nodes are marked as boundary at one of the domain boundaires. As far as I can tell, this case is not covered by Walberla's testing of the UBB. There, alsways SimpleUBB is used for channels.
I extracted the channel test to src/core/unit_tests/LbWalberla.cpp (boundary_test_shear) to get rid of potential Espresso interference. Could you please take a look at the constructor of the LbWalberla class in src/core/grid_based_algorithms/LbWalberla.cpp and at the LbWalberla::LB_Boundary_handling class in src/core/grid_based_algorithms/LbWalberla.hpp Maybe a boundary related communication or the like is still missing? The UBB uses ghost-layer fields. Do we have to add that to the communication in the LbWalberla constructor?
https://github.com/RudolfWeeber/espresso/tree/walberla/src/core/grid_based_algorithms
when nodes are marked as boundary at one of the domain boundaires
You need to consistently mark cells on both sides of the PBC yourself. Communication cannot take care of that for you. So when you flag a boundary at x=0, you need to do the same at x=L.
On Thu, Aug 15, 2019 at 04:21:07AM -0700, Michael Kuron wrote:
when nodes are marked as boundary at one of the domain boundaires
You need to consistently mark cells on both sides of the PBC yourself. Communication cannot take care of that for you. So when you flag a boundary at x=0, you need to do the same at x=L. Where L is the grid dimension and the index runs from 0 to L-1?
Exactly. So your ghost cells would be at x=-1 and X=L.
OK, so it's not pretty, but works. Since the opposing ghost cells can be on a different MPI rank, all ranks now go through the full LB grid and mark boundary cells shifted by a full lattice size in all combinations of coordiantes as well now.
In the steady state case, the shear profiles now match (testsuite/python/lb_shear.py with 500 integration steps). The time evolution is not yet correct. Maybe, the viscosity used in LbWalberla::LbWalberla has to be converted to lattice units.
With the viscosity converted to lattice units, the time-dependent shear profile is now reproduced
So, the boundary force from Walberla for the couette flow along the shear direction matches the expected result. (lb_shear.py)
@mkuron, the UBB also reports a force perpendicular to the wall. The Walberla test for BoundaryForcecouette does not check this. What is the expected value? It is of similar magnitude to the equilibrium pressure times surface area, but off by a factor of not quite 2-3 depending on parameters.
Equilibrium pressure is: p_eq = DENS * AGRID2 / TIME_STEP2 / 3
agrid, density, and viscosity are all !=1, but setting them to 1 doesn't help, so not unit conversions, I think.
@RudolfWeeber you seem to have forgotten to commit grid_based_algorithms/lbboundaries/LBBoundary.cpp
Added it.
Current state (to my understanding)
Working:
- Basic LB setup and most property getters/setters
- parallelization up to 2 cores (Walberla enumerates mpi nodes differently)
- Velocity boundary conditions validated via onset of Couette flow
- partial support for forces on boundaries (tangential force on boundaires for Couette flow)
- standard particle coupling (there are slight differences between what Walberla is doing and what Espresso is doing with regards to forces applied to the LB)
- inertialess tracer coupling
- lb electrodydrodynamics Not working:
- ENGINE (probably affected by the same bug as ENGINE for Espresso's cpu lb)
- setting bulk viscosity independently
- non-equal time steps for MD and Walberla LB
- Thermalization
This is the current state as per the test suite. Thermalization is waiting for the type erasure work. That will come tomorrow, if all goes well.
Walberla is currently limited to 2 mpi ranks.
Working:
- lb_buoyancy_force.py Passes on 2 cores
- lb_poiseuille.py passes on 2 cores
- lb_thermo_virtual.py Working on 2 cores
- lb_walberla.py passes on 2 cores, but probably obsolete by now
- linear_momentum_lb.py Passing on 2 cores
- lb_electrohydrodynamics.py
Woprking on 1 core:
- lb_boundary.py On 2 nodes, the head node doesn't see correct boundary flags on boundaries stored on the 2nd node. Probably Mpi ranks should only answer, if the site is in their local domain
- lb_boundary_volume_force.py passes on 1 core. Wrong values on 2 core. Issue with boundary force reduction?
Partial:
- lb.py Passes on 2 cores with thermalization tests disabled
- lb_momentum_conservation.py Worse results on 2 nodes than on 1. Liekly bug. Generally unclear, what tolerance is acceptable. Unit-test for viscous coupling needed
- lb_shear.py the velocity profile is correct on 2 cores, but the stress tensor is wrong on 1 and 2 cores (#3464)
Broken:
- engine_lb.py broken. Slight mismatch. This should be looked at after the unit-testing for particle coupling is done. Note that the total momentum used in the test now contains dt * F/2 for all point forces applied to lb in the previous time step. This is a behaviour change.
- lb_boundary_velocity.py Broken. Not entirely clear, why it eer worked, though
- lb_interpolation.py Broken (not examined)
- lb_poiseuille_cylinder.py Broken, but values appear to be proportional
- lb_stokes_sphere.py Results off by some 10 percent. Either issue with different boundary handling or with different LB model
- virtual_sites_tracers_walberla.py Broken. Liekly, the case of intertialess tracers with no lb active is not handled correctly
Waiting for thermalziation:
- lb_thermostat.py Waiting for thermalization
- lb_density.py Needs thermalization. Waiting for type reasure work
- lb.py thermalization part
Other:
- lb_streaming.py Not applicable until the same LB model as in Espresso is used.
@itischler, I merged your changes moving the interpolation to Espresso. Thanks.
I applied 2 corrections:
- Removed extra factor 8 in get_velocity_at_pos()
- The offset of the Lb Lattice is Agrid/2 or 1/2 in lattice units. I changed that (rom 0) in the calls to the bspline inpterpolation
I also added exceptions, if interpolation source nodes are inaccessible.
With these changes, the code works in the middle of the box. (lb_momentum_conservation.py). But as soon, as the particel gets close to agrid/2 to a box boundary, the total momentum derails. the lb_momentum_conservation.py test prints the momentum over time and a message, once the boundary gets close.
Would you be willing to investigate this further? Preferrably by mimicing the components of the particle coupling in the unit test of by making the stuff in lb_particle_coupling.cpp usable in the unit test (if it ins't already).
There are two possible causes:
- velocity interpolation (e.g., outdated info int the pdf field ghost layer)
- Loss or double-counting of applied forces. If F_c is the coupling force in a time step, then sum_i f_i should equal F_c, where i goes over all NON-GHOSt lattice sites on all mpi ranks. The terms are, to my understanding, only equal for the entire system, not per node. So, an MPI reduction would be needed in the unit test.
@RudolfWeeber the error in test lb_shear is random, with all the characteristics of a dangling pointer. I could reproduce it three times in the docker container on coyote10, but not systematically, and only with 2 MPI threads, and only the first time that make check_python was executed. Reproducing the error with a debug build of espresso yielded the following backtrace:
[1][ERROR ]-----(18.519 sec) Assertion failed!
[1] File: /home/espresso/espresso/build/walberla-prefix/src/walberla/src/field/allocation/FieldAllocator.h:149
[1] Expression: referenceCounts_.find(mem) != referenceCounts_.end()
[1]
[1]
[1] Fatal error came from /home/espresso/espresso/build/walberla-prefix/src/walberla/src/core/debug/CheckFunctions.cpp:41
[1] Aborting now ...
[1]
[1] Stack backtrace:
[1] Backtrace:
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::debug::printStacktrace(std::ostream&)
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::Abort::defaultAbort(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool)
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::Abort::abort(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::debug::check_functions_detail::ExitHandler::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
[1] /home/espresso/espresso/build/src/core/EspressoCore.so void walberla::debug::check_functions_detail::check<walberla::debug::check_functions_detail::ExitHandler>(char const*, char const*, int, walberla::debug::check_functions_detail::ExitHandler)
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::field::FieldAllocator<double>::decrementReferenceCount(double*)
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::field::Field<double, 19ul>::~Field()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::field::GhostLayerField<double, 19ul>::~GhostLayerField()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::lbm::PdfField<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >::~PdfField()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::lbm::PdfField<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >::~PdfField()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::domain_decomposition::internal::BlockData::Data<walberla::lbm::PdfField<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > >::~Data()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::domain_decomposition::internal::BlockData::Data<walberla::lbm::PdfField<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > >::~Data()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::domain_decomposition::IBlock::~IBlock()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::blockforest::BlockForest::~BlockForest()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so walberla::blockforest::StructuredBlockForest::~StructuredBlockForest()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so std::__shared_ptr<walberla::blockforest::StructuredBlockForest, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr()
[1] /home/espresso/espresso/build/src/core/EspressoCore.so std::shared_ptr<walberla::blockforest::StructuredBlockForest>::~shared_ptr()
[1]
[1]
[1]
[1] (from: /home/espresso/espresso/build/walberla-prefix/src/walberla/src/core/debug/CheckFunctions.cpp:41)
I couldn't reproduce it again after that, so I haven't got a GDB backtrace.
Finally managed to catch the exception in GDB using
/usr/bin/mpiexec -n 1 ./pypresso --gdb testsuite/python/lb_shear.py : -n 1 ./pypresso testsuite/python/lb_shear.py
However, the backtrace is completely different:
#0 0x00007f2b29d42ced in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00007f2b2ae676c8 in walberla::domain_decomposition::internal::BlockData::thrower<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> > (ptr=0x2179df0) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/domain_decomposition/IBlock.h:158
#2 0x00007f2b2ae27ba9 in walberla::domain_decomposition::internal::BlockData::get<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> > (this=0x2179270) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/domain_decomposition/IBlock.h:101
#3 0x00007f2b2ae1dea4 in walberla::domain_decomposition::internal::BlockData::get<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> > (this=0x2179270) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/domain_decomposition/IBlock.h:117
#4 0x00007f2b2ae11ed7 in walberla::domain_decomposition::IBlock::getData<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> > (this=0x2178df0, index=...) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/domain_decomposition/IBlock.h:342
#5 0x00007f2b2ae04f7d in walberla::domain_decomposition::IBlock::getData<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> > (this=0x2178df0, index=...) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/domain_decomposition/IBlock.h:355
#6 0x00007f2b2ae6d87b in walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >::configure (this=0x20d8498, block=...) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/lbm/lattice_model/ForceModel.h:573
warning: RTTI symbol not found for class 'walberla::blockforest::StructuredBlockForest'
#7 0x00007f2b2ae67e5c in walberla::lbm::LatticeModelBase<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2>::configure (this=0x20d8468, block=..., sbs=...) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/lbm/lattice_model/LatticeModelBase.h:102
#8 0x00007f2b2ae61cc8 in walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >::allocateDispatch (this=0x20d8430, block=0x2178df0, _initialize=true, initialDensity=1) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/lbm/field/AddToStorage.h:136
#9 0x00007f2b2ae508e8 in walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >::allocate (this=0x20d8430, block=0x2178df0) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/lbm/field/AddToStorage.h:93
#10 0x00007f2b2ae50600 in walberla::field::BlockDataHandling<walberla::lbm::PdfField<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >, false>::initialize (this=0x20d8430, block=0x2178df0) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/field/blockforest/BlockDataHandling.h:54
#11 0x00007f2b2ae4fce8 in walberla::blockforest::internal::BlockDataHandlingHelper<walberla::lbm::PdfField<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > >::initialize (this=0x2179b70, block=0x2178df0) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/blockforest/BlockDataHandling.h:126
#12 0x00007f2b2aecee8d in walberla::domain_decomposition::BlockStorage::addBlockData(walberla::selectable::SetSelectableObject<std::shared_ptr<walberla::domain_decomposition::internal::BlockDataHandlingWrapper>, walberla::uid::UID<walberla::uid::suidgenerator::S> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/espresso/es3/espresso/build/src/core/EspressoCore.so
#13 0x00007f2b2adc730f in walberla::blockforest::BlockForest::addBlockData (this=0x2178a70, dataHandling=..., identifier="pdf field") at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/blockforest/BlockForest.h:364
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >, std::allocator<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > >, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >, std::allocator<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > >, (__gnu_cxx::_Lock_policy)2>'
#14 0x00007f2b2adea479 in walberla::blockforest::BlockForest::addBlockData<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > > (this=0x2178a70, dataHandling=std::shared_ptr<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1> >, 2> >> (use count 3, weak count 0) = {...}, identifier="pdf field", requiredSelectors=..., incompatibleSelectors=...) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/blockforest/BlockForest.h:867
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >, std::allocator<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > >, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >, std::allocator<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > >, (__gnu_cxx::_Lock_policy)2>'
#15 0x00007f2b2ade06de in walberla::blockforest::StructuredBlockForest::addBlockData<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> > > (this=0x20f32d0, dataHandling=std::shared_ptr<walberla::lbm::internal::PdfFieldHandling<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1> >, 2> >> (use count 3, weak count 0) = {...}, identifier="pdf field", requiredSelectors=..., incompatibleSelectors=...) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/blockforest/StructuredBlockForest.h:139
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<walberla::blockforest::StructuredBlockForest, std::allocator<walberla::blockforest::StructuredBlockForest>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<walberla::blockforest::StructuredBlockForest, std::allocator<walberla::blockforest::StructuredBlockForest>, (__gnu_cxx::_Lock_policy)2>'
#16 0x00007f2b2add8380 in walberla::lbm::addPdfFieldToStorage<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2>, walberla::blockforest::StructuredBlockForest> (blocks=std::shared_ptr<walberla::blockforest::StructuredBlockForest> (use count 2, weak count 3) = {...}, identifier="pdf field", latticeModel=warning: RTTI symbol not found for class 'walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2>'
..., initialVelocity=..., initialDensity=1, ghostLayers=2, layout=@0x7ffc109d0380: walberla::field::zyxf, requiredSelectors=..., incompatibleSelectors=...) at /home/espresso/es3/espresso/build/walberla-prefix/src/walberla/src/lbm/field/AddToStorage.h:212
#17 0x00007f2b2add0faa in walberla::LbWalberla<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >::setup_with_valid_lattice_model (this=0x7ffc109d0d50) at /home/espresso/es3/espresso/src/core/grid_based_algorithms/LbWalberla_impl.hpp:344
#18 0x00007f2b2adcad58 in walberla::LbWalberlaD3Q19TRT::LbWalberlaD3Q19TRT (this=0x7ffc109d0d50, viscosity=0.28888888888888892, density=0.49679999999999991, agrid=0.59999999999999998, tau=0.02, box_dimensions=..., node_grid=..., n_ghost_layers=2) at /home/espresso/es3/espresso/src/core/grid_based_algorithms/LbWalberlaD3Q19TRT.hpp:17
#19 0x00007f2b2adc2905 in init_lb_walberla (viscosity=0.28888888888888892, density=0.49679999999999991, agrid=0.59999999999999998, tau=0.02, box_dimensions=..., node_grid=..., skin=0.23999999999999999) at /home/espresso/es3/espresso/src/core/grid_based_algorithms/lb_walberla_instance.cpp:46
#20 0x00007f2b2add281c in Communication::MpiCallbacks::call_all<double, double, double, double, Utils::Vector<double, 3ul> const&, Utils::Vector<int, 3ul> const&, double, double&, double, double&, double&, Utils::Vector<double, 3ul> const&, Utils::Vector<int, 3ul>&, double&> (this=0x20d7300, fp=0x7f2b2adc2856 <init_lb_walberla(double, double, double, double, Utils::Vector<double, 3ul> const&, Utils::Vector<int, 3ul> const&, double)>, args#0=@0x7ffc109d0fd8: 0.28888888888888892, args#1=@0x7ffc109d0fe0: 0.49679999999999991, args#2=@0x7ffc109d0fc8: 0.59999999999999998, args#3=@0x7ffc109d0fc0: 0.02, args#4=..., args#5=..., args#6=@0x7f2b2b435a70: 0.23999999999999999) at /home/espresso/es3/espresso/src/core/MpiCallbacks.hpp:573
#21 0x00007f2b2adc2c16 in mpi_init_lb_walberla (viscosity=0.28888888888888892, density=2.2999999999999998, agrid=0.59999999999999998, tau=0.02) at /home/espresso/es3/espresso/src/core/grid_based_algorithms/lb_walberla_instance.cpp:60
#22 0x00007f2b09888d8f in __pyx_pf_10espressomd_2lb_15LBFluidWalberla_10_activate_method (__pyx_v_self=0x7f2b18037b08) at /home/espresso/es3/espresso/build/src/python/espressomd/lb.cpp:7529
#23 0x00007f2b098882be in __pyx_pw_10espressomd_2lb_15LBFluidWalberla_11_activate_method (__pyx_v_self=0x7f2b18037b08, unused=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/lb.cpp:7434
#24 0x00007f2b2823c3f6 in __Pyx_CyFunction_CallMethod (func=0x7f2b231b3df0, self=0x7f2b18037b08, arg=0x7f2b2e7bd048, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/script_interface.cpp:15940
#25 0x00007f2b2823c704 in __Pyx_CyFunction_CallAsMethod (func=0x7f2b231b3df0, args=0x7f2b18038550, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/script_interface.cpp:15992
#26 0x00007f2b09ae0dc2 in __Pyx_PyObject_Call (func=0x7f2b231b3df0, arg=0x7f2b18038550, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:9403
#27 0x00007f2b09ae1043 in __Pyx__PyObject_CallOneArg (func=0x7f2b231b3df0, arg=0x7f2b18037b08) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:9442
#28 0x00007f2b09ae116a in __Pyx_PyObject_CallOneArg (func=0x7f2b231b3df0, arg=0x7f2b18037b08) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:9461
#29 0x00007f2b09ac76f3 in __pyx_pf_10espressomd_6actors_5Actor_6_activate (__pyx_v_self=0x7f2b18037b08) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:2491
#30 0x00007f2b09ac661e in __pyx_pw_10espressomd_6actors_5Actor_7_activate (__pyx_v_self=0x7f2b18037b08, unused=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:2265
#31 0x00007f2b2823c3f6 in __Pyx_CyFunction_CallMethod (func=0x7f2b231ab1b8, self=0x7f2b18037b08, arg=0x7f2b2e7bd048, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/script_interface.cpp:15940
#32 0x00007f2b2823c704 in __Pyx_CyFunction_CallAsMethod (func=0x7f2b231ab1b8, args=0x7f2b18038400, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/script_interface.cpp:15992
#33 0x00007f2b09ae0dc2 in __Pyx_PyObject_Call (func=0x7f2b231ab1b8, arg=0x7f2b18038400, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:9403
#34 0x00007f2b09ae1043 in __Pyx__PyObject_CallOneArg (func=0x7f2b231ab1b8, arg=0x7f2b18037b08) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:9442
#35 0x00007f2b09ae116a in __Pyx_PyObject_CallOneArg (func=0x7f2b231ab1b8, arg=0x7f2b18037b08) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:9461
#36 0x00007f2b09ad3ea9 in __pyx_pf_10espressomd_6actors_6Actors_4add (__pyx_self=0x7f2b231ac048, __pyx_v_self=0x7f2b2d345e10, __pyx_v_actor=0x7f2b18037b08) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:6033
#37 0x00007f2b09ad3a2d in __pyx_pw_10espressomd_6actors_6Actors_5add (__pyx_self=0x7f2b231ac048, __pyx_args=0x7f2b180906c8, __pyx_kwds=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/actors.cpp:5965
#38 0x00007f2b2823c385 in __Pyx_CyFunction_CallMethod (func=0x7f2b231ac048, self=0x7f2b231ac048, arg=0x7f2b180906c8, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/script_interface.cpp:15935
#39 0x00007f2b2823c5f6 in __Pyx_CyFunction_Call (func=0x7f2b231ac048, arg=0x7f2b180906c8, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/script_interface.cpp:15974
#40 0x00007f2b2823c759 in __Pyx_CyFunction_CallAsMethod (func=0x7f2b231ac048, args=0x7f2b180906c8, kw=0x0) at /home/espresso/es3/espresso/build/src/python/espressomd/script_interface.cpp:15995
I cut at frame 40 because frames 41 to 106 are in the Python executable and don't have debug info. If you need the full GDB log, it's in gdb-bt.log
@RudolfWeeber I can break on the assertion line, but can't do much from there: with continue GDB will exit on the assertion, with catch throw then continue GDB will catch the irrelevant __cxa_throw in the next loop. If I set a breakpoint on the body of the assertion macro before the MPI abort (shown below), GDB won't actually catch it.
(gdb) set breakpoint pending on
break /home/espresso/es4/espresso/build/walberla-prefix/src/walberla/src/core/debug/CheckFunctions.cpp:41
run
(gdb) No symbol table is loaded. Use the "file" command.
Breakpoint 1 (/home/espresso/es4/espresso/build/walberla-prefix/src/walberla/src/core/debug/CheckFunctions.cpp:41) pending.
(gdb) warning: Error disabling address space randomization: Operation not permitted
Starting program: /usr/bin/python3 testsuite/python/lb_shear.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fda97198700 (LWP 47878)]
[New Thread 0x7fda96997700 (LWP 47879)]
[1][ERROR ]-----(18.393 sec) Assertion failed!
[1] File: /home/espresso/es4/espresso/build/walberla-prefix/src/walberla/src/field/allocation/FieldAllocator.h:149
[1] Expression: referenceCounts_.find(mem) != referenceCounts_.end()
...
[1] (from: /home/espresso/es4/espresso/build/walberla-prefix/src/walberla/src/core/debug/CheckFunctions.cpp:41)
The assertion is raised on rank 1 almost every time, no wonder I couldn't catch it from a GDB session on rank 0. When starting 2 GDB sessions, mpiexec automatically passes 'quit' to the extra GDB process. Cannot start two GDB sessions in windows with xterm -e because docker has no GUI. Tried starting two GDB sessions in a terminal multiplexer with mpirun -np 2 screen -AdmS mpi ./pypresso --gdb="-ex run" testsuite/python/lb_shear.py but I got an ompi_mpi_init: ompi_rte_init failed in docker (and outside of docker too). Running out of ideas.
It is apparently possible to forwad X out of a container
http://fabiorehm.com/blog/2014/09/11/running-gui-apps-with-docker/
Unable to forward an X-window in a docker container through an shh connection despite considerable efforts. The issue is reproducible on the institute machines, although it's a lot less frequent.
To simplify things, I added std::raise(SIGABRT); (from #include<csignal>) at line 40 of walberla/src/core/debug/CheckFunctions.cpp, right before WALBERLA_ABORT(). On Linux, GDB will stop at the abort signal without the need to set up a catch throw or a breakpoint. Next, simply run GDB in X-windows, many many times, until the error triggers:
mpirun -np 2 xterm -fa 'Monospace' -fs 13 -e ./pypresso --gdb="-ex run" testsuite/python/lb_shear.py
GDB backtrace:
#0 __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00001555515e7017 in walberla::debug::check_functions_detail::ExitHandler::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /work/jgrad/es-walberla/espresso/build/src/core/EspressoCore.so
#2 0x00001555514f9128 in walberla::debug::check_functions_detail::check<walberla::debug::check_functions_detail::ExitHandler> (
expression=0x15555174d048 "referenceCounts_.find(mem) != referenceCounts_.end()",
filename=0x15555174cfe0 "/work/jgrad/es-walberla/espresso/build/walberla-prefix/src/walberla/src/field/allocation/FieldAllocator.h", line=149, failFunc=...)
at /work/jgrad/es-walberla/espresso/build/walberla-prefix/src/walberla/src/core/debug/CheckFunctions.impl.h:288
#3 0x000015555156f2ca in walberla::field::FieldAllocator<walberla::math::Vector3<double> >::decrementReferenceCount (this=0xe87ea0, mem=0x1375730)
at /work/jgrad/es-walberla/espresso/build/walberla-prefix/src/walberla/src/field/allocation/FieldAllocator.h:149
#4 0x000015555156e7b2 in walberla::field::Field<walberla::math::Vector3<double>, 1ul>::~Field (this=0x13568f0, __in_chrg=<optimized out>)
at /work/jgrad/es-walberla/espresso/build/walberla-prefix/src/walberla/src/field/Field.impl.h:324
#5 0x0000155551589114 in walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul>::~GhostLayerField (this=0x13568f0, __in_chrg=<optimized out>)
at /work/jgrad/es-walberla/espresso/build/walberla-prefix/src/walberla/src/field/GhostLayerField.h:88
#6 0x0000155551589130 in walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul>::~GhostLayerField (this=0x13568f0, __in_chrg=<optimized out>)
at /work/jgrad/es-walberla/espresso/build/walberla-prefix/src/walberla/src/field/GhostLayerField.h:88
#7 0x00001555515aaa66 in walberla::domain_decomposition::internal::BlockData::Data<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >::~Data
(this=0x13a60b0, __in_chrg=<optimized out>) at /work/jgrad/es-walberla/espresso/build/walberla-prefix/src/walberla/src/domain_decomposition/IBlock.h:61
#8 0x00001555515aaa8e in walberla::domain_decomposition::internal::BlockData::Data<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >::~Data
(this=0x13a60b0, __in_chrg=<optimized out>) at /work/jgrad/es-walberla/espresso/build/walberla-prefix/src/walberla/src/domain_decomposition/IBlock.h:61
#9 0x00001555516000ac in walberla::domain_decomposition::IBlock::~IBlock() () from /work/jgrad/es-walberla/espresso/build/src/core/EspressoCore.so
#10 0x0000155551310cc6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x129e930) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#11 0x00001555516547e8 in walberla::blockforest::BlockForest::~BlockForest() () from /work/jgrad/es-walberla/espresso/build/src/core/EspressoCore.so
#12 0x0000155551630961 in walberla::blockforest::StructuredBlockForest::~StructuredBlockForest() () from /work/jgrad/es-walberla/espresso/build/src/core/EspressoCore.so
#13 0x0000155551310cc6 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x127f8b0) at /usr/include/c++/7/bits/shared_ptr_base.h:154
#14 0x0000155551310c81 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x13574a8, __in_chrg=<optimized out>)
at /usr/include/c++/7/bits/shared_ptr_base.h:684
#15 0x00001555514f7b94 in std::__shared_ptr<walberla::blockforest::StructuredBlockForest, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x13574a0,
__in_chrg=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:1123
#16 0x00001555514f7bb0 in std::shared_ptr<walberla::blockforest::StructuredBlockForest>::~shared_ptr (this=0x13574a0, __in_chrg=<optimized out>)
at /usr/include/c++/7/bits/shared_ptr.h:93
#17 0x00001555514f7d1a in walberla::LbWalberla<walberla::lbm::D3Q19<walberla::lbm::collision_model::TRT, false, walberla::lbm::force_model::GuoField<walberla::field::GhostLayerField<walberla::math::Vector3<double>, 1ul> >, 2> >::~LbWalberla (this=0x1357420, __in_chrg=<optimized out>)
at /work/jgrad/es-walberla/espresso/src/core/grid_based_algorithms/LbWalberla_impl.hpp:770
#18 0x0000155551507da4 in walberla::LbWalberlaD3Q19TRT::~LbWalberlaD3Q19TRT (this=0x1357420, __in_chrg=<optimized out>)
at /work/jgrad/es-walberla/espresso/src/core/grid_based_algorithms/LbWalberlaD3Q19TRT.hpp:9
#19 0x0000155551507dc0 in walberla::LbWalberlaD3Q19TRT::~LbWalberlaD3Q19TRT (this=0x1357420, __in_chrg=<optimized out>)
at /work/jgrad/es-walberla/espresso/src/core/grid_based_algorithms/LbWalberlaD3Q19TRT.hpp:9
#20 0x0000155551507cfe in std::default_delete<LbWalberlaBase>::operator() (this=0x155551b682f0 <(anonymous namespace)::lb_walberla_instance>, __ptr=0x1357420)
at /usr/include/c++/7/bits/unique_ptr.h:78
#21 0x00001555514ff25b in std::unique_ptr<LbWalberlaBase, std::default_delete<LbWalberlaBase> >::~unique_ptr (
this=0x155551b682f0 <(anonymous namespace)::lb_walberla_instance>, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:268
#22 0x0000155554f7e041 in __run_exit_handlers (status=0, listp=0x155555326718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true,
run_dtors=run_dtors@entry=true) at exit.c:108
#23 0x0000155554f7e13a in __GI_exit (status=<optimized out>) at exit.c:139
#24 0x00000000006384f7 in Py_Exit (sts=sts@entry=0) at ../Python/pylifecycle.c:1565
#25 0x00000000006385c0 in handle_system_exit () at ../Python/pythonrun.c:626
#26 0x00000000006385ec in PyErr_PrintEx () at ../Python/pythonrun.c:636
#27 0x0000000000638ab3 in PyErr_Print () at ../Python/pythonrun.c:532
#28 PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:425
#29 0x0000000000638c65 in PyRun_AnyFileExFlags () at ../Python/pythonrun.c:81
#30 0x0000000000639631 in run_file (p_cf=0x7fffffffdb7c, filename=<optimized out>, fp=<optimized out>) at ../Modules/main.c:340
#31 Py_Main () at ../Modules/main.c:810
#32 0x00000000004b0f40 in main (argc=2, argv=0x7fffffffdd78) at ../Programs/python.c:69
Following @mkuron's advice, I added bookkeeping printf statements in every memory allocation and deallocation in walberlasrc/field/allocation/FieldAllocator.h to track down which object got its reference counter decremented too far. The output for both threads is attached in decref-thread0.log (where the error occurred) and decref-thread1.log (which hanged). Each function call prints the full function signature, the pointer, the call to deallocate if it happened, and the reference counter value after the increment/decrement (in decrementReferenceCount(T*) I actually print that value before and after the decrement with a - resp. + sign, because this is where the assertion error is triggered). If it's unclear, or if you need to reproduce the logs locally, the code diff is available in decref-printf-patch.txt.
The relevant part of the first log is this:
T* walberla::field::FieldAllocator<T>::allocate(walberla::uint_t, walberla::uint_t, walberla::uint_t, walberla::uint_t, walberla::uint_t&, walberla::uint_t&, walberla::uint_t&) [with T = walberla::math::Vector3<double>; walberla::uint_t = long unsigned int]
mem=0x13400d0
referenceCounts_[0x13400d0] = 1
... more instructions ...
bool walberla::field::FieldAllocator<T>::decrementReferenceCount(T*) [with T = walberla::math::Vector3<double>]
mem=0x13400d0
-referenceCounts_[0x13400d0] = 0
rank 0: ExitHandler::operator()
The assertion catches the incorrect reference count. Although, it seems impossible to free memory twice because the deallocate() call has a guard on refCount == 0, which is skipped if the reference counter is negative, so there shouldn't be a segfault.
Anyway, the address 0x13400d0 doesn't appear anywhere else in that log, nor in the other log, so it's unclear to me how its reference counter got decremented to 0. The referenceCounts_ member isn't accessed in other C++ files. Could it be that during garbage collection, parts of the std::map<T*, uint_t> referenceCounts_ get accidentally overwritten by zeros?
@jngrad, thanks for testing. I would have liked to also see backtraces of the two points that print mem=0x13400d0, but I already have a suspicion what might be happening.
In https://i10git.cs.fau.de/walberla/walberla/-/blob/master/src/field/allocation/FieldAllocator.h#L201, the reference counts map is declared as a static member variable of the allocator. In https://i10git.cs.fau.de/walberla/walberla/-/blob/master/src/field/allocation/FieldAllocator.h#L206, it is defined. Perhaps the referenceCounts_ symbol ends up in multiple shared object files, and Espresso's linker flags however somehow suppress the duplicate symbol warning? Could you please print out the memory address of referenceCounts_ every time you print the reference count to check whether it changes?
An alternative cause could be that the destruction order is wrong, though I see no reason why I should be. The reference count map is constructed when the shared object that creates the field is loaded. The field is constructed after the blockforest has been constructed and is destroyed as part of the blockforest's destruction. Only when the shared object is unloaded should the reference count map be destroyed.
@mkuron I'll try that. @fweik had the same suspicion.
@RudolfWeeber we probably should also resolve any merge conflicts ASAP. Otherwise we will have a (citing @jngrad) "merge-party".
@mkuron @fweik I couldn't reproduce the error when walberla::debug::printStacktrace() prints the stack trace. Printing the std::map address reveals no collision, i.e. the map address is the same in both threads, and memory is freed only in the thread where the allocation occurred. The output is in thread0.log, thread1.log. The relevant part is:
T* walberla::field::FieldAllocator<T>::allocate(walberla::uint_t) [with T = double; walberla::uint_t = long unsigned int]
mem=0x13d1670
referenceCounts_[0x13d1670] = 1 (std::map at 0x155551b689c0)
bool walberla::field::FieldAllocator<T>::decrementReferenceCount(T*) [with T = double]
mem=0x13d1670
-referenceCounts_[0x13d1670] = 1 (std::map at 0x155551b689c0)
+referenceCounts_[0x13d1670] = 0 (std::map at 0x155551b689c0)
deallocate()
...
T* walberla::field::FieldAllocator<T>::allocate(walberla::uint_t) [with T = double; walberla::uint_t = long unsigned int]
mem=0x13d1670
referenceCounts_[0x13d1670] = 1 (std::map at 0x155551b689c0)
bool walberla::field::FieldAllocator<T>::decrementReferenceCount(T*) [with T = double]
mem=0x13d1670
-referenceCounts_[0x13d1670] = 0 (std::map at 0x155551b689c0)
Thread 1 "python3" received signal SIGABRT, Aborted
Ok, but this would mean that there is a halo reduction (of the force density) anyway?
No, I don’t think so. Each particle within the halo region of a node is coupled exactly once. The distribution functions in the Walberla connection need to make sure, the total amound arrives on the correct cells, across all nodes.
Let’s discuss in person, once I’m back.