mpich icon indicating copy to clipboard operation
mpich copied to clipboard

bug: ucx hang at finalize with 72 processes on a single node

Open yfguo opened this issue 6 years ago • 2 comments

The following failure is presented consistently for CH4-UCX build. The test is marked as xfail now.

  ---
  Directory: ./datatype
  File: darray_pack
  Num-procs: 72
  Timeout: 360
  Date: "Wed Feb 27 17:02:48 2019"
  ...
## Test output (expected 'No Errors'):
## [bb68:25247:0:25563] ucp_proxy_ep.c:215  Assertion `proxy_ep->uct_ep != ((void *)0)' failed
## ==== backtrace ====
##     0  /home/autotest/software/ucx/lib/libucs.so.0(ucs_fatal_error+0x104) [0x2aeda392b364]
##     1  /home/autotest/software/ucx/lib/libucp.so.0(+0x15f0c) [0x2aeda36c7f0c]
##     2  /home/autotest/software/ucx/lib/libucp.so.0(+0x3e7ef) [0x2aeda36f07ef]
##     3  /home/autotest/software/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x75) [0x2aeda36f3235]
##     4  /home/autotest/software/ucx/lib/libucp.so.0(+0x42035) [0x2aeda36f4035]
##     5  /home/autotest/software/ucx/lib/libucp.so.0(+0x423de) [0x2aeda36f43de]
##     6  /home/autotest/software/ucx/lib/libucp.so.0(+0x433e0) [0x2aeda36f53e0]
##     7  /home/autotest/software/ucx/lib/libuct.so.0(+0x3d1fd) [0x2aeda4aea1fd]
##     8  /home/autotest/software/ucx/lib/libuct.so.0(uct_ud_ep_process_rx+0x200) [0x2aeda4aead40]
##     9  /home/autotest/software/ucx/lib/libuct.so.0(+0x420af) [0x2aeda4aef0af]
##    10  /home/autotest/software/ucx/lib/libuct.so.0(+0x39511) [0x2aeda4ae6511]
##    11  /home/autotest/software/ucx/lib/libuct.so.0(+0x39597) [0x2aeda4ae6597]
##    12  /home/autotest/software/ucx/lib/libucs.so.0(+0xc8fb) [0x2aeda391e8fb]
##    13  /home/autotest/software/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x28) [0x2aeda391f768]
##    14  /home/autotest/software/ucx/lib/libucs.so.0(ucs_async_dispatch_timerq+0xc6) [0x2aeda391f8a6]
##    15  /home/autotest/software/ucx/lib/libucs.so.0(+0x10383) [0x2aeda3922383]
##    16  /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x2aeda405de9a]
##    17  /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x2aeda308b38d]
## ===================
##  No Errors
## 
## ===================================================================================
## =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
## =   PID 25247 RUNNING AT ib64-18
## =   EXIT CODE: 134
## =   CLEANING UP REMAINING PROCESSES
## =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
## ===================================================================================
## YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
## This typically refers to a problem with your application.
## Please see the FAQ page for debugging suggestions```

yfguo avatar Feb 28 '19 01:02 yfguo

This test still fails for ucx, but error symptom changed. It is a sporadic TIMEOUT waiting for requests from ucp_disconnect_nb

The failure had nothing to do with the darray_pack part, but only due to the number of processes, 72 in this case. I could reproduce the issue with empty MPI_Init/MPI_Finalize.

There are 72 disconnect requests each process. Most of them complete successfully, but often 1 process will be left hanging, often at 1 / 72.

With 64 processes, I ran 100 round without issues.

hzhou avatar Jun 05 '21 05:06 hzhou

Occassionally I get assertion errors:

[41] [pmrs-gpu-240-02:17509:0:17509]   callbackq.c:539  Assertion `idx < priv->num_fast_elems' failed
[0]  No Errors
[41]
[41] /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.c: [ ucs_callbackq_remove_safe() ]
[41]       ...
[41]       536     } else {
[41]       537         UCS_STATIC_ASSERT(UCS_CALLBACKQ_FAST_MAX <= 64);
[41]       538         ucs_assert(idx < priv->num_fast_elems);
[41] ==>   539         priv->fast_remove_mask |= UCS_BIT(idx);
[41]       540         cbq->fast_elems[idx].id = UCS_CALLBACKQ_ID_NULL; /* for assertion */
[41]       541         ucs_callbackq_enable_proxy(cbq);
[41]       542     }
[41]
[41] ==== backtrace (tid:  17509) ====
[41]  0 0x00000000000590f5 ucs_debug_print_backtrace()  /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/debug/debug.c:656
[41]  1 0x00000000000508e6 ucs_callbackq_remove_safe()  /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.c:539
[41]  2 0x00000000000508e6 ucs_callbackq_remove_safe()  /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.c:542
[41]  3 0x0000000000018c8e uct_ib_device_async_event_unregister()  /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/ib/base/ib_device.c:316
[41]  4 0x000000000002858f uct_rc_ep_cleanup_qp_done()  /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/ib/rc/base/rc_ep.c:126
[41]  5 0x000000000002858f uct_rc_ep_cleanup_qp_done()  /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/ib/rc/base/rc_ep.c:129
[41]  6 0x0000000000017ff0 uct_ib_device_async_event_proxy()  /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/ib/base/ib_device.c:203
[41]  7 0x000000000004ef4b ucs_callbackq_slow_proxy()  /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.c:400
[41]  8 0x0000000000036eba ucs_callbackq_dispatch()  /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.h:211
[41]  9 0x0000000000036eba uct_worker_progress()  /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/api/uct.h:2435
[41] 10 0x0000000000036eba ucp_worker_progress()  /home/zhouh/pmrs/hzhou/modules/ucx/src/ucp/core/ucp_worker.c:2405
[41] 11 0x000000000047c881 MPIDI_UCX_mpi_finalize_hook()  /home/zhouh/temp/mpich-5332/src/mpid/ch4/netmod/ucx/ucx_init.c:399
[41] 12 0x00000000004829f8 MPID_Finalize()  /home/zhouh/temp/mpich-5332/src/mpid/ch4/src/ch4_init.c:566
[41] 13 0x0000000000448abc MPII_Finalize()  /home/zhouh/temp/mpich-5332/src/mpi/init/mpir_init.c:341
[41] 14 0x00000000002c8021 internal_Finalize()  /home/zhouh/temp/mpich-5332/src/binding/c/init/finalize.c:38
[41] 15 0x00000000002c8021 PMPI_Finalize()  /home/zhouh/temp/mpich-5332/src/binding/c/init/finalize.c:105
[41] 16 0x00000000004029d2 MTest_Finalize()  /home/zhouh/temp/mpich-5332/test/mpi/datatype/../util/mtest.c:208
[41] 17 0x0000000000401b85 main()  /home/zhouh/temp/mpich-5332/test/mpi/datatype/darray_pack.c:55
[41] 18 0x0000000000022555 __libc_start_main()  ???:0
[41] 19 0x0000000000401bea _start()  ???:0
[41] =================================

hzhou avatar Jun 05 '21 14:06 hzhou