mpich
mpich copied to clipboard
bug: ucx hang at finalize with 72 processes on a single node
The following failure is presented consistently for CH4-UCX build. The test is marked as xfail now.
---
Directory: ./datatype
File: darray_pack
Num-procs: 72
Timeout: 360
Date: "Wed Feb 27 17:02:48 2019"
...
## Test output (expected 'No Errors'):
## [bb68:25247:0:25563] ucp_proxy_ep.c:215 Assertion `proxy_ep->uct_ep != ((void *)0)' failed
## ==== backtrace ====
## 0 /home/autotest/software/ucx/lib/libucs.so.0(ucs_fatal_error+0x104) [0x2aeda392b364]
## 1 /home/autotest/software/ucx/lib/libucp.so.0(+0x15f0c) [0x2aeda36c7f0c]
## 2 /home/autotest/software/ucx/lib/libucp.so.0(+0x3e7ef) [0x2aeda36f07ef]
## 3 /home/autotest/software/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x75) [0x2aeda36f3235]
## 4 /home/autotest/software/ucx/lib/libucp.so.0(+0x42035) [0x2aeda36f4035]
## 5 /home/autotest/software/ucx/lib/libucp.so.0(+0x423de) [0x2aeda36f43de]
## 6 /home/autotest/software/ucx/lib/libucp.so.0(+0x433e0) [0x2aeda36f53e0]
## 7 /home/autotest/software/ucx/lib/libuct.so.0(+0x3d1fd) [0x2aeda4aea1fd]
## 8 /home/autotest/software/ucx/lib/libuct.so.0(uct_ud_ep_process_rx+0x200) [0x2aeda4aead40]
## 9 /home/autotest/software/ucx/lib/libuct.so.0(+0x420af) [0x2aeda4aef0af]
## 10 /home/autotest/software/ucx/lib/libuct.so.0(+0x39511) [0x2aeda4ae6511]
## 11 /home/autotest/software/ucx/lib/libuct.so.0(+0x39597) [0x2aeda4ae6597]
## 12 /home/autotest/software/ucx/lib/libucs.so.0(+0xc8fb) [0x2aeda391e8fb]
## 13 /home/autotest/software/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x28) [0x2aeda391f768]
## 14 /home/autotest/software/ucx/lib/libucs.so.0(ucs_async_dispatch_timerq+0xc6) [0x2aeda391f8a6]
## 15 /home/autotest/software/ucx/lib/libucs.so.0(+0x10383) [0x2aeda3922383]
## 16 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x2aeda405de9a]
## 17 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x2aeda308b38d]
## ===================
## No Errors
##
## ===================================================================================
## = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
## = PID 25247 RUNNING AT ib64-18
## = EXIT CODE: 134
## = CLEANING UP REMAINING PROCESSES
## = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
## ===================================================================================
## YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
## This typically refers to a problem with your application.
## Please see the FAQ page for debugging suggestions```
This test still fails for ucx, but error symptom changed. It is a sporadic TIMEOUT waiting for requests from ucp_disconnect_nb
The failure had nothing to do with the darray_pack part, but only due to the number of processes, 72 in this case. I could reproduce the issue with empty MPI_Init/MPI_Finalize.
There are 72 disconnect requests each process. Most of them complete successfully, but often 1 process will be left hanging, often at 1 / 72.
With 64 processes, I ran 100 round without issues.
Occassionally I get assertion errors:
[41] [pmrs-gpu-240-02:17509:0:17509] callbackq.c:539 Assertion `idx < priv->num_fast_elems' failed
[0] No Errors
[41]
[41] /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.c: [ ucs_callbackq_remove_safe() ]
[41] ...
[41] 536 } else {
[41] 537 UCS_STATIC_ASSERT(UCS_CALLBACKQ_FAST_MAX <= 64);
[41] 538 ucs_assert(idx < priv->num_fast_elems);
[41] ==> 539 priv->fast_remove_mask |= UCS_BIT(idx);
[41] 540 cbq->fast_elems[idx].id = UCS_CALLBACKQ_ID_NULL; /* for assertion */
[41] 541 ucs_callbackq_enable_proxy(cbq);
[41] 542 }
[41]
[41] ==== backtrace (tid: 17509) ====
[41] 0 0x00000000000590f5 ucs_debug_print_backtrace() /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/debug/debug.c:656
[41] 1 0x00000000000508e6 ucs_callbackq_remove_safe() /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.c:539
[41] 2 0x00000000000508e6 ucs_callbackq_remove_safe() /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.c:542
[41] 3 0x0000000000018c8e uct_ib_device_async_event_unregister() /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/ib/base/ib_device.c:316
[41] 4 0x000000000002858f uct_rc_ep_cleanup_qp_done() /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/ib/rc/base/rc_ep.c:126
[41] 5 0x000000000002858f uct_rc_ep_cleanup_qp_done() /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/ib/rc/base/rc_ep.c:129
[41] 6 0x0000000000017ff0 uct_ib_device_async_event_proxy() /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/ib/base/ib_device.c:203
[41] 7 0x000000000004ef4b ucs_callbackq_slow_proxy() /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.c:400
[41] 8 0x0000000000036eba ucs_callbackq_dispatch() /home/zhouh/pmrs/hzhou/modules/ucx/src/ucs/datastruct/callbackq.h:211
[41] 9 0x0000000000036eba uct_worker_progress() /home/zhouh/pmrs/hzhou/modules/ucx/src/uct/api/uct.h:2435
[41] 10 0x0000000000036eba ucp_worker_progress() /home/zhouh/pmrs/hzhou/modules/ucx/src/ucp/core/ucp_worker.c:2405
[41] 11 0x000000000047c881 MPIDI_UCX_mpi_finalize_hook() /home/zhouh/temp/mpich-5332/src/mpid/ch4/netmod/ucx/ucx_init.c:399
[41] 12 0x00000000004829f8 MPID_Finalize() /home/zhouh/temp/mpich-5332/src/mpid/ch4/src/ch4_init.c:566
[41] 13 0x0000000000448abc MPII_Finalize() /home/zhouh/temp/mpich-5332/src/mpi/init/mpir_init.c:341
[41] 14 0x00000000002c8021 internal_Finalize() /home/zhouh/temp/mpich-5332/src/binding/c/init/finalize.c:38
[41] 15 0x00000000002c8021 PMPI_Finalize() /home/zhouh/temp/mpich-5332/src/binding/c/init/finalize.c:105
[41] 16 0x00000000004029d2 MTest_Finalize() /home/zhouh/temp/mpich-5332/test/mpi/datatype/../util/mtest.c:208
[41] 17 0x0000000000401b85 main() /home/zhouh/temp/mpich-5332/test/mpi/datatype/darray_pack.c:55
[41] 18 0x0000000000022555 __libc_start_main() ???:0
[41] 19 0x0000000000401bea _start() ???:0
[41] =================================