framework icon indicating copy to clipboard operation
framework copied to clipboard

test hydro5_4proc_3sd_4proc stuck in MPI_Barrier

Open guignont opened this issue 2 years ago • 3 comments

Hello, The test hydro5_4proc_3sd_4proc does not finish and all the MPI (intelMPI) processes are stuck in MPI_Barrier from MPIParallelSuperMng destructor.

Backtrace: (gdb) bt #0 0x00007f19195c75d8 in MPIDI_SHMGR_release_generic (opcode=447039912, mpir_comm=0x7fff2b79f788, root=1, localbuf=0x1b2, count=515801232, datatype=33046036, errflag=0x7fff2b7b2b70, knomial_factor=4, algo_type=MPIDI_SHMGR_ALGO_KNOMIAL) at ../../src/mpid/ch4/src/intel/ch4_shm_coll_templates.h:231 #1 0x00007f19195bca27 in MPIDI_SHMGR_Release (comm=0x7f191aa549a8 <PVAR_TIMER_idle+8>, errflag=0x7fff2b79f788, algo_type=MPIDI_SHMGR_ALGO_KNOMIAL, radix=434) at ../../src/mpid/ch4/src/intel/ch4_shm_coll.c:2609 #2 0x00007f1919533444 in MPIDI_Barrier_intra_composition_zeta (comm_ptr=, errflag=, ch4_algo_parameters_container=) at ../../src/mpid/ch4/src/intel/ch4_coll_impl.h:319 #3 MPID_Barrier_invoke (comm=0x7f191aa549a8 <PVAR_TIMER_idle+8>, errflag=0x7fff2b79f788, ch4_algo_parameters_container=0x1) at ../../src/mpid/ch4/src/intel/autoreg_ch4_coll.h:56 #4 0x00007f1919509621 in MPIDI_coll_invoke (coll_sig=0x7f191aa549a8 <PVAR_TIMER_idle+8>, container=0x7fff2b79f788, req=0x1) at ../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3138 #5 0x00007f19194ed6aa in MPIDI_coll_select (coll_sig=0x7f191aa549a8 <PVAR_TIMER_idle+8>, req=0x7fff2b79f788) at ../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130 #6 0x00007f19195d6d45 in MPID_Barrier (comm=, errflag=) at ../../src/mpid/ch4/src/intel/ch4_coll.h:31 #7 MPIR_Barrier (comm_ptr=0x7f191aa549a8 <PVAR_TIMER_idle+8>, errflag=0x7fff2b79f788) at ../../src/mpi/coll/intel/coll_impl.c:349 #8 0x00007f19194cd7ed in PMPI_Barrier (comm=447039912) at ../../src/mpi/coll/barrier/barrier.c:266 #9 0x00007f191df29b88 in Arcane::MpiParallelSuperMng::~MpiParallelSuperMng (this=0xbb7ac0, __in_chrg=) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/impl/Communicator.h:139 #10 0x00007f191df29bd9 in Arcane::MpiParallelSuperMng::~MpiParallelSuperMng (this=0xbb7ac0, __in_chrg=) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/impl/Communicator.h:123 #11 0x00007f191affd84a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0xc0dd50) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/impl/Ref.h:161 #12 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0xc0dd50) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/impl/Ref.h:161 #13 0x00007f191afefa2e in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=, __in_chrg=) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/impl/Ref.h:1154 #14 std::__shared_ptr<Arcane::IParallelSuperMng, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=, __in_chrg=) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/impl/Ref.h:1154 #15 std::__shared_ptr<Arcane::IParallelSuperMng, (__gnu_cxx::_Lock_policy)2>::reset (this=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/impl/Ref.h:1272 #16 Arccore::Ref<Arcane::IParallelSuperMng, 0>::reset (this=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/impl/CheckedPointer.h:250 #17 Arcane::Application::~Application (this=0xb9efd0, __in_chrg=) at /work/guignont/work/ArcaneV3/GIT/framework/arccore/src/concurrency/arccore/concurrency/ArcaneMain.h:189 #18 0x00007f191afeff09 in Arcane::Application::~Application (this=0xb9efd0, __in_chrg=) at /work/guignont/work/ArcaneV3/GIT/framework/arccore/src/concurrency/arccore/concurrency/ArcaneMain.h:159 #19 0x00007f191b027597 in Arcane::ArcaneMain::~ArcaneMain (this=0xb7e6f0, __in_chrg=) at /work/guignont/work/ArcaneV3/GIT/framework/arccore/src/collections/arccore/collections/ArcaneMain.h:1204 #20 0x00007f191b0613dd in Arcane::ArcaneMainBatch::~ArcaneMainBatch (this=0xb7e6f0, __in_chrg=) at /soft/irsrvsoft1/expl/eb/r11/el_8-x86_64/easybuild/software/GCCcore/11.2.0/include/c++/11.2.0/bits/IFunctor.h:558 #21 Arcane::ArcaneMainBatch::~ArcaneMainBatch (this=0xb7e6f0, __in_chrg=) at /work/guignont/work/ArcaneV3/GIT/framework/arccore/src/collections/arccore/collections/ParallelReplication.h:271 #22 0x00007f191b02836a in Arcane::ArcaneMainExecInfo::finalize (this=this@entry=0x7fff2b7b2cb0) at /work/guignont/work/ArcaneV3/GIT/framework/arccore/src/collections/arccore/collections/ArcaneMain.h:483 #23 0x00007f191b02a446 in Arcane::ArcaneMain::_arcaneMain (app_info=..., factory=) at /work/guignont/work/ArcaneV3/GIT/framework/arccore/src/collections/arccore/collections/ArcaneMain.h:518 #24 0x00007f191b02a52a in Arcane::ArcaneMain::arcaneMain (app_info=..., factory=0xb7fe90) at /work/guignont/work/ArcaneV3/GIT/framework/arccore/src/collections/arccore/collections/ArcaneMain.h:899 #25 0x00007f191b02a5d9 in Arcane::ArcaneMain::run () at /work/guignont/work/ArcaneV3/GIT/framework/arccore/src/collections/arccore/collections/ArcaneMain.h:976 #26 0x00007f191eabf1d5 in Arcane::ArcaneLauncher::run () at /soft/irsrvsoft1/expl/eb/r11/el_8-x86_64/easybuild/software/GCCcore/11.2.0/include/c++/11.2.0/bits/Array.h:114 #27 0x0000000000401e3a in _mainHelper (argc=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, argv=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /work/guignont/work/ArcaneV3/GIT/framework/cmake-build-release/_common/build_all/arcane/src/arcane/tests/std_function.h:86 #28 0x0000000000401ec9 in operator() (__closure=) at /work/guignont/work/ArcaneV3/GIT/framework/cmake-build-release/_common/build_all/arcane/src/arcane/tests/std_function.h:97 #29 std::__invoke_impl<void, main(int, char**)::<lambda()>&> (__f=...) at /work/guignont/work/ArcaneV3/GIT/framework/cmake-build-release/_common/build_all/arcane/src/arcane/tests/char_traits.h:61 #30 std::__invoke_r<void, main(int, char**)::<lambda()>&> (__fn=...) at /work/guignont/work/ArcaneV3/GIT/framework/cmake-build-release/_common/build_all/arcane/src/arcane/tests/char_traits.h:111 #31 std::_Function_handler<void(), main(int, char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/tests/ArcaneTestMain.cc:291 --Type <RET> for more, q to quit, c to continue without paging-- #32 0x00007f19188178bf in std::function<void ()>::operator()() const (this=this@entry=0x7fff2b7b2e50) at /soft/irsrvsoft1/expl/eb/r11/el_8-x86_64/easybuild/software/GCCcore/11.2.0/include/c++/11.2.0/bits/Exception.h:556 #33 Arcane::arcaneCallFunctionAndCatchException(std::function<void ()>) (function=...) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/utils/Exception.cc:93 #34 0x0000000000401a63 in main (argc=, argv=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /work/guignont/work/ArcaneV3/GIT/framework/arcane/src/arcane/tests/ArcaneTestMain.cc:236

guignont avatar Aug 07 '23 13:08 guignont

Hello Thomas.

Yes this problem had been identified and this test is not run in the CI for IFPEN. But I think this problem specific to IntelMPI because this test runs fine with OpenMPI and MPICH.

I will try to reproduce your problem but I need to know the version of IntelMPI you are using.

grospelliergilles avatar Aug 08 '23 07:08 grospelliergilles

Hello, the Intel MPI version is 2021.4.0 (gimkl2021b toolchain). FYI the test works with version 2018.3.222 (gimkl2018b toolchain).

A+ Thomas

guignont avatar Aug 09 '23 08:08 guignont

Thanks. I will try to reproduce this problem

grospelliergilles avatar Aug 09 '23 11:08 grospelliergilles