mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/hcoll: fix call hcoll_do_progress

Open hzhou opened this issue 1 year ago • 1 comments

Pull Request Description

Previously we added a vci parameter to progress hooks. We negelected update one of the two calls to hcoll_do_progress.

[skip warnings]

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Jun 30 '24 20:06 hzhou

test:mpich/custom netmod: ch4:ucx config: hcoll

Hangs during init:

Thread 1 "cpi" received signal SIGINT, Interrupt.
0x00007ffff5d13e93 in progress () at src/mpid/common/hcoll/hcoll_rte.c:61
61  }
#0  0x00007ffff5d13e93 in progress () at src/mpid/common/hcoll/hcoll_rte.c:61
#1  0x00007ffff5658161 in wait_completion ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#2  0x00007ffff55c713b in comm_allreduce_hcolrte_generic ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#3  0x00007ffff55c7a14 in comm_allreduce_hcolrte ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#4  0x00007ffff565bc7a in hcoll_get_context_from_cache ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#5  0x00007ffff56582f5 in hcoll_create_context ()
   from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#6  0x00007ffff5d0f166 in hcoll_comm_create (
    comm_ptr=comm_ptr@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>,
    param=param@entry=0x0) at src/mpid/common/hcoll/hcoll_init.c:158
#7  0x00007ffff5cc5a09 in MPIDI_UCX_mpi_comm_commit_pre_hook (
    comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
    at src/mpid/ch4/netmod/ucx/ucx_comm.c:19
#8  0x00007ffff5ccb6ea in MPID_Comm_commit_pre_hook (
    comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
    at src/mpid/ch4/src/ch4_comm.c:197
#9  0x00007ffff5c4f63d in MPIR_Comm_commit_internal (
    comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
    at src/mpi/comm/commutil.c:584
#10 0x00007ffff5c55ea8 in MPIR_Comm_commit (
    comm=0x7ffff7f5ec40 <MPIR_Comm_builtin>) at src/mpi/comm/commutil.c:799
#11 0x00007ffff5c45cf5 in MPIR_init_comm_world ()
    at src/mpi/comm/builtin_comms.c:33
#12 0x00007ffff5c84ca5 in MPII_Init_thread (argc=argc@entry=0x7fffffffda3c,
    argv=argv@entry=0x7fffffffda30, user_required=<optimized out>,
    provided=provided@entry=0x7fffffffd9cc,
    p_session_ptr=p_session_ptr@entry=0x0) at src/mpi/init/mpir_init.c:267
#13 0x00007ffff5c8538a in MPIR_Init_impl (argc=argc@entry=0x7fffffffda3c,
    argv=argv@entry=0x7fffffffda30) at src/mpi/init/mpir_init.c:136
#14 0x00007ffff5b0411c in internal_Init (argv=0x7fffffffda30,
    argc=0x7fffffffda3c) at src/binding/c/c_binding.c:49972
#15 PMPI_Init (argc=0x7fffffffda3c, argv=0x7fffffffda30)
    at src/binding/c/c_binding.c:50023
#16 0x0000555555555347 in main ()

hzhou avatar Aug 09 '24 18:08 hzhou

test:mpich/custom netmod: ch4:ucx config: hcoll

hzhou avatar Jun 23 '25 15:06 hzhou