GEOS
GEOS copied to clipboard
MPI issue using Surface Generator
Bug description
Running this test case, that needs this mesh file, with more than 1 process, e.g. mpirun -np 2 geosx -x 2
, the code waits forever. If I interrupt it with Ctrl+C, I have:
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2:
Frame 3: opal_progress
Frame 4: ompi_request_default_wait
Frame 5: ompi_coll_base_sendrecv_actual
Frame 6: ompi_coll_base_allgather_intra_two_procs
Frame 7: MPI_Allgather
Frame 8: void geosx::MpiWrapper::allGather<long>(long, LvArray::Array<long, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>&, ompi_communicator_t*)
Frame 9: geosx::CommunicationTools::AssignNewGlobalIndices(geosx::ObjectManagerBase&, std::set<long, std::less<long>, std::allocator<long> > const&)
Frame 10: geosx::SurfaceGenerator::SeparationDriver(geosx::DomainPartition*, geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, int, int, bool, double)
Frame 11: geosx::SurfaceGenerator::SolverStep(double const&, double const&, int, geosx::DomainPartition*)
Frame 12: geosx::SurfaceGenerator::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 13: geosx::EventBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 14: geosx::EventManager::Run(geosx::dataRepository::Group*)
Frame 15: geosx::ProblemManager::RunSimulation()
Frame 16: main
Frame 17: __libc_start_main
Frame 18: _start
As now, I'm using the branch #799, but the surfaceGenerator
kernel is the same as in develop
.
Platform:
- Machine: Ubuntu 18.04
- Compiler: gcc 7.4.0
- Cmake: 3.10.2
Note
The extensions are: xml
for the main input and msh
for the mesh. github
forced me to use txt
.
I created a simpler version that should just duplicate the nodes alone a fracture of an unstructured mesh. Because the surfaceGenerator
function call TwoPointFluxApproximation
that requires the pressure
field to be defined:
https://github.com/GEOSX/GEOSX/blob/886e9107d0e2a34b9616bfabeee85a59cc95634d/src/coreComponents/finiteVolume/TwoPointFluxApproximation.cpp#L295
any run of geosx
with this input has to fail with this error:
** StackTrace of 13 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: cxx_utilities::handler1(int)
Frame 3: LvArray::Array<double, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>& geosx::dataRepository::Group::getReference<LvArray::Array<double, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
Frame 4: LvArray::Array<double, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>& geosx::dataRepository::Group::getReference<LvArray::Array<double, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer> >(char const*)
Frame 5: geosx::TwoPointFluxApproximation::addToFractureStencil(geosx::DomainPartition&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
Frame 6: geosx::SurfaceGenerator::SolverStep(double const&, double const&, int, geosx::DomainPartition*)
Frame 7: geosx::SurfaceGenerator::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 8: geosx::EventBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 9: geosx::EventManager::Run(geosx::dataRepository::Group*)
Frame 10: geosx::ProblemManager::RunSimulation()
Frame 11: main
Frame 12: __libc_start_main
Frame 13: _start
Nevertheless, with mpirun -np 3 geosx -x 3 -i file
I have:
** StackTrace of 11 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: cxx_utilities::handler1(int)
Frame 3: geosx::verifyGhostingConsistency(geosx::ObjectManagerBase const&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> > const&)
Frame 4: geosx::CommunicationTools::FindGhosts(geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, bool)
Frame 5: geosx::DomainPartition::SetupCommunications(bool)
Frame 6: geosx::ProblemManager::InitializePostSubGroups(geosx::dataRepository::Group*)
Frame 7: geosx::dataRepository::Group::Initialize(geosx::dataRepository::Group*)
Frame 8: geosx::ProblemManager::ProblemSetup()
Frame 9: main
Frame 10: __libc_start_main
Frame 11: _start
To be more precise, I prepared this test case, that defines the pressure
field, in such a way that the simulation can reach the end.
Running with the configuration mpirun -np 3 geosx -x 3 -i file.xml
, I have:
Rank 1: Expected to send 0 non local ghosts to rank 2 but sending 8
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 1: Encountered a ghosting inconsistency in nodeManager
Rank 2: Expected to send 0 non local ghosts to rank 0 but sending 4
Rank 2: Expected to send 0 non local ghosts to rank 1 but sending 8
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 2: Encountered a ghosting inconsistency in nodeManager
Received signal 1: Hangup
I really don't know what can be the cause, but ... can it be something similar to #663?
I realized that the issue is not related with Surface Generator
. The problem can be reproduced even without the SurfaceGenerator
step. Running mpirun -np 3 geosx -i file
this pair of xml and msh, I have:
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 2: Encountered a ghosting inconsistency in nodeManager
Rank 1: Expected to send 0 non local ghosts to rank 2 but sending 7
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 1: Encountered a ghosting inconsistency in nodeManager
Received signal 1: Hangup
** StackTrace of 10 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: geosx::verifyGhostingConsistency(geosx::ObjectManagerBase const&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> > const&)
Frame 3: geosx::CommunicationTools::FindGhosts(geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, bool)
Frame 4: geosx::DomainPartition::SetupCommunications(bool)
Frame 5: geosx::ProblemManager::InitializePostSubGroups(geosx::dataRepository::Group*)
Frame 6: geosx::dataRepository::Group::Initialize(geosx::dataRepository::Group*)
Frame 7: geosx::ProblemManager::ProblemSetup()
Frame 8: main
Frame 9: __libc_start_main
Frame 10: _start
The partition is this:
Any idea on the possible problem? It seems to be related on how GEOSX handles the partitioning of an unstructured mesh.
@af1990 I have a PR I hope to finish up today or tomorrow but then I'll look into it. This seems very similar to #633.
@af1990 I have some good news and some bad news. The good news is that I fixed the error you were getting related to #633. The bad news is that this error was a false positive and I almost certainly didn't fix the issue you are having with the Surface Generator.
Working with this mesh and 2 processes, I realized that the fracture nodes are properly split only on one process (rank 0), while it seems that rank 1 sees the nodes on the interface between the two processes as not doubled.
In this figure, there is the ghost rank for all the elements,
while here there is the ghost rank for the fracture. The highlighted nodes are the problem. They are on the interface between rank 0 and rank 1 but only rank 0 see them as doubled (so the fracture is open), while rank 1 sees them still not double (and the fracture is closed).
This creates an inconsistency between ranks and is not physically correct (all the fracture is open, except for the two right-most and left-most edges).
The fracture is created as a pre-step before the simulation and never changes.
@rrsettgast, am I using SurfaceGenerator in a wrong way? Have you ever observed something similar?
@af1990 I will have to take a look. This looks like a pretty substantial bug, but I have seen this case work previously. Perhaps we made a change and did not have coverage for this case. Can you send me your input file?
Yes, that's the file. It's a simple flow simulation that, with develop
branch, and mpirun -np 2
produces:
Rank 0: 0 3 37 465 81 1639 1676 1711 1654
Rank 0: 1 38 3 81 467 1655 1713 1676 1639
Rank 0: 2 82 466 80 7 1677 1640 1675 1712
Rank 0: 3 468 82 7 83 1714 1678 1640 1677
Rank 0: 4 25 24 439 441 1642 1687 1685 1641
Rank 0: 6 26 25 441 443 1643 1689 1687 1642
Rank 0: 7 27 26 443 445 1644 1691 1689 1643
Rank 0: 8 28 27 445 447 1645 1693 1691 1644
Rank 0: 9 29 28 447 449 1646 1695 1693 1645
Rank 0: 10 30 29 449 451 1647 1697 1695 1646
Rank 0: 11 31 30 451 453 1648 1699 1697 1647
Rank 0: 12 32 31 453 455 1649 1701 1699 1648
Rank 0: 13 33 32 455 457 1650 1703 1701 1649
Rank 0: 14 34 33 457 459 1651 1705 1703 1650
Rank 0: 15 35 34 459 461 1652 1707 1705 1651
Rank 0: 16 36 35 461 463 1653 1709 1707 1652
Rank 0: 17 37 36 463 465 1654 1711 1709 1653
Rank 0: 18 39 38 467 469 1656 1715 1713 1655
Rank 0: 19 40 39 469 471 1657 1717 1715 1656
Rank 0: 20 41 40 471 473 1658 1719 1717 1657
Rank 0: 21 42 41 473 475 1659 1721 1719 1658
Rank 0: 22 43 42 475 477 1660 1723 1721 1659
Rank 0: 23 44 43 477 479 1661 1725 1723 1660
Rank 0: 24 4 44 479 65 4 65 1725 1661
Rank 0: 25 442 440 67 68 1688 1663 1662 1686
Rank 0: 27 444 442 68 69 1690 1664 1663 1688
Rank 0: 28 446 444 69 70 1692 1665 1664 1690
Rank 0: 29 448 446 70 71 1694 1666 1665 1692
Rank 0: 30 450 448 71 72 1696 1667 1666 1694
Rank 0: 31 452 450 72 73 1698 1668 1667 1696
Rank 0: 32 454 452 73 74 1700 1669 1668 1698
Rank 0: 33 456 454 74 75 1702 1670 1669 1700
Rank 0: 34 458 456 75 76 1704 1671 1670 1702
Rank 0: 35 460 458 76 77 1706 1672 1671 1704
Rank 0: 36 462 460 77 78 1708 1673 1672 1706
Rank 0: 37 464 462 78 79 1710 1674 1673 1708
Rank 0: 38 466 464 79 80 1712 1675 1674 1710
Rank 0: 39 81 465 466 82 1676 1677 1712 1711
Rank 0: 40 467 81 82 468 1713 1714 1677 1676
Rank 0: 41 470 468 83 84 1716 1679 1678 1714
Rank 0: 42 472 470 84 85 1718 1680 1679 1716
Rank 0: 43 474 472 85 86 1720 1681 1680 1718
Rank 0: 44 476 474 86 87 1722 1682 1681 1720
Rank 0: 45 478 476 87 88 1724 1683 1682 1722
Rank 0: 46 480 478 88 89 1726 1684 1683 1724
Rank 0: 47 66 480 89 6 66 6 1684 1726
Rank 0: 48 441 439 440 442 1687 1688 1686 1685
Rank 0: 50 443 441 442 444 1689 1690 1688 1687
Rank 0: 51 445 443 444 446 1691 1692 1690 1689
Rank 0: 52 447 445 446 448 1693 1694 1692 1691
Rank 0: 53 449 447 448 450 1695 1696 1694 1693
Rank 0: 54 451 449 450 452 1697 1698 1696 1695
Rank 0: 55 453 451 452 454 1699 1700 1698 1697
Rank 0: 56 455 453 454 456 1701 1702 1700 1699
Rank 0: 57 457 455 456 458 1703 1704 1702 1701
Rank 0: 58 459 457 458 460 1705 1706 1704 1703
Rank 0: 59 461 459 460 462 1707 1708 1706 1705
Rank 0: 60 463 461 462 464 1709 1710 1708 1707
Rank 0: 61 465 463 464 466 1711 1712 1710 1709
Rank 0: 62 469 467 468 470 1715 1716 1714 1713
Rank 0: 63 471 469 470 472 1717 1718 1716 1715
Rank 0: 64 473 471 472 474 1719 1720 1718 1717
Rank 0: 65 475 473 474 476 1721 1722 1720 1719
Rank 0: 66 477 475 476 478 1723 1724 1722 1721
Rank 0: 67 479 477 478 480 1725 1726 1724 1723
Rank 0: 68 65 479 480 66 65 66 1726 1725
and
Rank 1: 0 4 37 405 67 1654 1679 1702 1662
Rank 1: 1 38 4 67 407 1663 1704 1679 1654
Rank 1: 2 68 406 66 7 1680 1655 1678 1703
Rank 1: 3 408 68 7 69 1705 1681 1655 1680
Rank 1: 4 31 3 58 393 1656 1690 58 3
Rank 1: 5 32 31 393 395 1657 1692 1690 1656
Rank 1: 6 33 32 395 397 1658 1694 1692 1657
Rank 1: 7 34 33 397 399 1659 1696 1694 1658
Rank 1: 8 35 34 399 401 1660 1698 1696 1659
Rank 1: 9 36 35 401 403 1661 1700 1698 1660
Rank 1: 10 37 36 403 405 1662 1702 1700 1661
Rank 1: 11 39 38 407 409 1664 1706 1704 1663
Rank 1: 12 40 39 409 411 1665 1708 1706 1664
Rank 1: 13 41 40 411 413 1666 1710 1708 1665
Rank 1: 14 42 41 413 415 1667 1712 1710 1666
Rank 1: 15 43 42 415 417 1668 1714 1712 1667
Rank 1: 16 44 43 417 419 1669 1716 1714 1668
Rank 1: 17 45 44 419 421 1670 1718 1716 1669
Rank 1: 18 46 45 421 423 1671 1720 1718 1670
Rank 1: 19 1320 46 423 1358 1722 1728 423 46 <----
Rank 1: 20 394 59 6 60 1691 1672 6 59
Rank 1: 21 396 394 60 61 1693 1673 1672 1691
Rank 1: 22 398 396 61 62 1695 1674 1673 1693
Rank 1: 23 400 398 62 63 1697 1675 1674 1695
Rank 1: 24 402 400 63 64 1699 1676 1675 1697
Rank 1: 25 404 402 64 65 1701 1677 1676 1699
Rank 1: 26 406 404 65 66 1703 1678 1677 1701
Rank 1: 27 67 405 406 68 1679 1680 1703 1702
Rank 1: 28 407 67 68 408 1704 1705 1680 1679
Rank 1: 29 410 408 69 70 1707 1682 1681 1705
Rank 1: 30 412 410 70 71 1709 1683 1682 1707
Rank 1: 31 414 412 71 72 1711 1684 1683 1709
Rank 1: 32 416 414 72 73 1713 1685 1684 1711
Rank 1: 33 418 416 73 74 1715 1686 1685 1713
Rank 1: 34 420 418 74 75 1717 1687 1686 1715
Rank 1: 35 422 420 75 76 1719 1688 1687 1717
Rank 1: 36 424 422 76 77 1721 1689 1688 1719
Rank 1: 37 1359 424 77 1321 1729 1725 77 424 <----
Rank 1: 38 393 58 59 394 1690 1691 59 58
Rank 1: 39 395 393 394 396 1692 1693 1691 1690
Rank 1: 40 397 395 396 398 1694 1695 1693 1692
Rank 1: 41 399 397 398 400 1696 1697 1695 1694
Rank 1: 42 401 399 400 402 1698 1699 1697 1696
Rank 1: 43 403 401 402 404 1700 1701 1699 1698
Rank 1: 44 405 403 404 406 1702 1703 1701 1700
Rank 1: 45 409 407 408 410 1706 1707 1705 1704
Rank 1: 46 411 409 410 412 1708 1709 1707 1706
Rank 1: 47 413 411 412 414 1710 1711 1709 1708
Rank 1: 48 415 413 414 416 1712 1713 1711 1710
Rank 1: 49 417 415 416 418 1714 1715 1713 1712
Rank 1: 50 419 417 418 420 1716 1717 1715 1714
Rank 1: 51 421 419 420 422 1718 1719 1717 1716
Rank 1: 52 423 421 422 424 1720 1721 1719 1718
Rank 1: 53 1358 423 424 1359 1728 1729 424 423 <----
where the first number is the element index, while the 8 following are nodes of top/bottom faces. You can see that elements 24, 47 and 68 for rank 0 and 4, 20 and 38 for rank1 have the same nodes on both top and bottom faces and that's right!
The problem is that elements 19, 37 and 53 on rank1 have not split nodes too!!!
I realized that the issue is not related with
Surface Generator
. The problem can be reproduced even without theSurfaceGenerator
step. Runningmpirun -np 3 geosx -i file
this pair of xml and msh, I have:***** ERROR ***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526 ***** Controlling expression (should be false): error ***** Rank 2: Encountered a ghosting inconsistency in nodeManager Rank 1: Expected to send 0 non local ghosts to rank 2 but sending 7 ***** ERROR ***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526 ***** Controlling expression (should be false): error ***** Rank 1: Encountered a ghosting inconsistency in nodeManager Received signal 1: Hangup ** StackTrace of 10 frames ** Frame 1: cxx_utilities::handler(int, int, int) Frame 2: geosx::verifyGhostingConsistency(geosx::ObjectManagerBase const&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> > const&) Frame 3: geosx::CommunicationTools::FindGhosts(geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, bool) Frame 4: geosx::DomainPartition::SetupCommunications(bool) Frame 5: geosx::ProblemManager::InitializePostSubGroups(geosx::dataRepository::Group*) Frame 6: geosx::dataRepository::Group::Initialize(geosx::dataRepository::Group*) Frame 7: geosx::ProblemManager::ProblemSetup() Frame 8: main Frame 9: __libc_start_main Frame 10: _start
The partition is this:
Any idea on the possible problem? It seems to be related on how GEOSX handles the partitioning of an unstructured mesh.
This is solved by #864. Nevertheless ... there's another problem (with the same settings):
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/managers/ObjectManagerBase.hpp:492
***** Controlling expression (should be false): !allValuesMapped
***** Rank 2: some values of unmappedIndices were not used
Received signal 1: Hangup
** StackTrace of 12 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: void geosx::ObjectManagerBase::FixUpDownMaps<geosx::InterObjectRelation<LvArray::ArrayOfArrays<long, long> > >(geosx::InterObjectRelation<LvArray::ArrayOfArrays<long, long> >&, geosx::mapBase<long, LvArray::Array<long long, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>, std::integral_constant<bool, true> >&, bool)
Frame 3: geosx::FaceManager::FixUpDownMaps(bool)
Frame 4: geosx::ParallelTopologyChange::SynchronizeTopologyChange(geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, geosx::ModifiedObjectLists&, geosx::ModifiedObjectLists&, int)
Frame 5: geosx::SurfaceGenerator::SeparationDriver(geosx::DomainPartition*, geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, int, int, bool, double)
Frame 6: geosx::SurfaceGenerator::SolverStep(double const&, double const&, int, geosx::DomainPartition*)
Frame 7: geosx::SurfaceGenerator::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 8: geosx::EventBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 9: geosx::EventManager::Run(geosx::dataRepository::Group*)
Frame 10: main
Frame 11: __libc_start_main
Frame 12: _start
====
@af1990 What is the status of this issue?
For unstructured grids, the parallel surface generator still has problems, such as the inconsistency across ranks of the splitting.
@rrsettgast @joshua-white - I think that I'm running into this issue as well. I have a number of external meshes that conform to one or more faults surfaces. I'm getting the same message as @andrea-franceschini when I try to split the mesh in parallel runs. This is one of the simple meshes that I'm testing, which has a 45 degree fault cutting through it:
Any thoughts on how to address this? I've attached the example xml file and mesh here. I get the error with the arguments `-x 2 -y 2 -z 2 test.zip
`
@rrsettgast - I've created a 2x2x2 mesh with a vertical fault that shows the same behavior (see attached). The mesh appears to be split correctly if there is a partition aligned with the surface (-x 2
, -y 2
, or -z 2
). However if a partition corner is on the surface (-x 2 -y 2
, -x 2 -z 2
, or -y 2 -z 2
), then we get the error:
***** ERROR
***** LOCATION: /usr/WS2/sherman/GEOSX/src/coreComponents/mesh/ObjectManagerBase.hpp:1024
***** Controlling expression (should be false): !allValuesMapped
***** Rank 2: some values of unmappedIndices were not used
** StackTrace of 10 frames **
Frame 0: geosx::EdgeManager::fixUpDownMaps(bool)
Frame 1: geosx::ParallelTopologyChange::synchronizeTopologyChange(geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, geosx::ModifiedObjectLists&, geosx::ModifiedObjectLists&, int)
Frame 2: geosx::SurfaceGenerator::separationDriver(geosx::DomainPartition&, geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, int, int, bool, double)
Frame 3: geosx::SurfaceGenerator::solverStep(double const&, double const&, int, geosx::DomainPartition&)
Frame 4: geosx::SurfaceGenerator::execute(double, double, int, int, double, geosx::DomainPartition&)
Frame 5: geosx::EventBase::execute(double, double, int, int, double, geosx::DomainPartition&)
Frame 6: geosx::EventManager::run(geosx::DomainPartition&)
Frame 7: geosx::GeosxState::run()
Frame 8: main
Frame 9: __libc_start_main
Frame 10: /g/g17/sherman/GEOS/geosx/GEOSX/[email protected]/bin/geosx
=====
Note: the problem doesn't run for 8 partitions (-x 2 -y 2 -z 2
) due to an error that occurs in pamela for such a small mesh.
@joshua-white @andrea-franceschini @cssherman @CusiniM @herve-gross This old issue has never been fixed. Recently, I hit the same road block when running the single fracture compression problem with Lagrangian Contact Solver. In this case, external mesh is used and it seems like that the SurfaceGenerator
is incompatible with unstructured mesh (PAMELAMeshGenerator
), if running with multiple ranks.
By plotting both the silo and vtk output of shear displacement and comparing with the analytical solution, same issue is observed for the case running with 2 ranks, which confirms that it is not related to the output format. Moreover, this anomaly happens at the partition boundary, which suggests that the parallel surface generator does not work properly with unstructured mesh in parallel.
@jhuang2601 - Agreed. I tried to look at this with the example I've included above, and am suspicious that it is an issue with Metis partitioning (I couldn't nail anything down though). I'm curious to see if the vtk mesh generator will be subject to the same issue...
@rrsettgast - I've created a 2x2x2 mesh with a vertical fault that shows the same behavior (see attached). The mesh appears to be split correctly if there is a partition aligned with the surface (
-x 2
,-y 2
, or-z 2
). However if a partition corner is on the surface (-x 2 -y 2
,-x 2 -z 2
, or-y 2 -z 2
), then we get the error:***** ERROR ***** LOCATION: /usr/WS2/sherman/GEOSX/src/coreComponents/mesh/ObjectManagerBase.hpp:1024 ***** Controlling expression (should be false): !allValuesMapped ***** Rank 2: some values of unmappedIndices were not used ** StackTrace of 10 frames ** Frame 0: geosx::EdgeManager::fixUpDownMaps(bool) Frame 1: geosx::ParallelTopologyChange::synchronizeTopologyChange(geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, geosx::ModifiedObjectLists&, geosx::ModifiedObjectLists&, int) Frame 2: geosx::SurfaceGenerator::separationDriver(geosx::DomainPartition&, geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, int, int, bool, double) Frame 3: geosx::SurfaceGenerator::solverStep(double const&, double const&, int, geosx::DomainPartition&) Frame 4: geosx::SurfaceGenerator::execute(double, double, int, int, double, geosx::DomainPartition&) Frame 5: geosx::EventBase::execute(double, double, int, int, double, geosx::DomainPartition&) Frame 6: geosx::EventManager::run(geosx::DomainPartition&) Frame 7: geosx::GeosxState::run() Frame 8: main Frame 9: __libc_start_main Frame 10: /g/g17/sherman/GEOS/geosx/GEOSX/[email protected]/bin/geosx =====
Note: the problem doesn't run for 8 partitions (
-x 2 -y 2 -z 2
) due to an error that occurs in pamela for such a small mesh.
Since this is read from PAMELA, the requested partition layout is irrelevant. So the -x2 -y2
are ignored. Metis partitions the problem however it sees fit. In this case, the Metis partitions are:
So it is a wonky partition...but we should still be able to handle it. I suspect the ghosting is incorrect.
This is great that you got a tiny reproducer. It's always been clear that there was issues with the ghost cells, but with this mesh it's surely easier to debug 🤞 🎉 🌮 If you want I can try to help you in the debugging?