mpich
mpich copied to clipboard
ch4/hcoll: fix call hcoll_do_progress
Pull Request Description
Previously we added a vci parameter to progress hooks. We negelected update one of the two calls to hcoll_do_progress.
[skip warnings]
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/custom netmod: ch4:ucx config: hcoll
Hangs during init:
Thread 1 "cpi" received signal SIGINT, Interrupt.
0x00007ffff5d13e93 in progress () at src/mpid/common/hcoll/hcoll_rte.c:61
61 }
#0 0x00007ffff5d13e93 in progress () at src/mpid/common/hcoll/hcoll_rte.c:61
#1 0x00007ffff5658161 in wait_completion ()
from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#2 0x00007ffff55c713b in comm_allreduce_hcolrte_generic ()
from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#3 0x00007ffff55c7a14 in comm_allreduce_hcolrte ()
from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#4 0x00007ffff565bc7a in hcoll_get_context_from_cache ()
from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#5 0x00007ffff56582f5 in hcoll_create_context ()
from /nfs/gce/projects/pmrs/opt/hpcx-v2.17.1-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/hcoll/lib/libhcoll.so.1
#6 0x00007ffff5d0f166 in hcoll_comm_create (
comm_ptr=comm_ptr@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>,
param=param@entry=0x0) at src/mpid/common/hcoll/hcoll_init.c:158
#7 0x00007ffff5cc5a09 in MPIDI_UCX_mpi_comm_commit_pre_hook (
comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
at src/mpid/ch4/netmod/ucx/ucx_comm.c:19
#8 0x00007ffff5ccb6ea in MPID_Comm_commit_pre_hook (
comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
at src/mpid/ch4/src/ch4_comm.c:197
#9 0x00007ffff5c4f63d in MPIR_Comm_commit_internal (
comm=comm@entry=0x7ffff7f5ec40 <MPIR_Comm_builtin>)
at src/mpi/comm/commutil.c:584
#10 0x00007ffff5c55ea8 in MPIR_Comm_commit (
comm=0x7ffff7f5ec40 <MPIR_Comm_builtin>) at src/mpi/comm/commutil.c:799
#11 0x00007ffff5c45cf5 in MPIR_init_comm_world ()
at src/mpi/comm/builtin_comms.c:33
#12 0x00007ffff5c84ca5 in MPII_Init_thread (argc=argc@entry=0x7fffffffda3c,
argv=argv@entry=0x7fffffffda30, user_required=<optimized out>,
provided=provided@entry=0x7fffffffd9cc,
p_session_ptr=p_session_ptr@entry=0x0) at src/mpi/init/mpir_init.c:267
#13 0x00007ffff5c8538a in MPIR_Init_impl (argc=argc@entry=0x7fffffffda3c,
argv=argv@entry=0x7fffffffda30) at src/mpi/init/mpir_init.c:136
#14 0x00007ffff5b0411c in internal_Init (argv=0x7fffffffda30,
argc=0x7fffffffda3c) at src/binding/c/c_binding.c:49972
#15 PMPI_Init (argc=0x7fffffffda3c, argv=0x7fffffffda30)
at src/binding/c/c_binding.c:50023
#16 0x0000555555555347 in main ()
test:mpich/custom netmod: ch4:ucx config: hcoll