ompi
ompi copied to clipboard
HCOLL fails to save coll handlers and ends in Segfault when used with HAN
I'm trying to run coll/han with coll/hcoll as a backend but see the following issue on both main and 5.0.x on Hawk (ConnectX-6 fabric):
mpirun -N 4 -n 16 --mca coll_han_priority 100 --mca coll_adapt_priority 0 --mca coll_hcoll_enable 1 --mca coll_tuned_priority 10 --mca coll_hcoll_priority 80 ~/src/osu-benchmarks/osu-micro-benchmarks-5.6.2/build/mpi/collective/osu_reduce
# OSU MPI Reduce Latency Test v5.6.2
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
[r37c4t7n4:44848] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t7n4:44846] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t7n4:44845] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t7n4:44847] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n3:52753] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n4:64546] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n2:136568] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n3:52752] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n4:64544] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n3:52751] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n4:64547] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n3:52754] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n4:64545] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n2:136569] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n2:136567] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n2:136566] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
4 6.28 0.53 51.10 1000
8 6.13 0.54 48.90 1000
16 12.88 0.56 74.71 1000
32 12.92 0.58 74.65 1000
64 12.93 0.61 74.62 1000
128 13.18 0.65 75.69 1000
256 13.42 0.62 77.71 1000
512 14.28 0.67 83.06 1000
1024 14.93 0.81 85.44 1000
2048 17.71 0.94 100.48 1000
4096 39.10 4.10 227.27 1000
8192 50.44 4.67 294.51 1000
16384 64.91 6.21 375.91 1000
32768 100.64 9.80 583.28 1000
65536 171.64 22.55 966.53 100
131072 199.31 34.74 997.37 100
262144 259.61 68.76 927.96 100
524288 588.19 144.79 1817.62 100
1048576 1988.30 1510.59 3094.54 100
At the end of the run I get a Segfault:
==== backtrace (tid: 136568) ====
0 libucs.so.0(ucs_handle_error+0x254) [0x7fe976256594]
1 libucs.so.0(+0x2d777) [0x7fe976256777]
2 libucs.so.0(+0x2da4e) [0x7fe976256a4e]
3 /lib64/libpthread.so.0(+0x12b20) [0x7fe977b53b20]
4 /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_update_context_cache_on_group_destruction+0x9e) [0x7fe9774ff84e]
5 /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_context_free+0x148) [0x7fe9774fd4c8]
6 libmpi.so.80(+0x1102d3) [0x7fe9787a02d3]
7 libmpi.so.80(+0x678f9) [0x7fe9786f78f9]
8 libmpi.so.80(ompi_attr_delete_all+0x173) [0x7fe9786f9293]
9 libmpi.so.80(ompi_comm_free+0x3c) [0x7fe9786fc5ec]
10 libmpi.so.80(+0x150246) [0x7fe9787e0246]
11 libmpi.so.80(mca_coll_base_comm_unselect+0x1d79) [0x7fe97878a379]
12 libmpi.so.80(+0x69cbc) [0x7fe9786f9cbc]
13 libmpi.so.80(+0x6a2c9) [0x7fe9786fa2c9]
14 libopen-pal.so.80(opal_finalize_cleanup_domain+0x4a) [0x7fe978b1292a]
15 libopen-pal.so.80(opal_finalize+0x3f) [0x7fe978b12a9f]
16 libmpi.so.80(ompi_rte_finalize+0x13a) [0x7fe9787281fa]
17 libmpi.so.80(+0x9df6c) [0x7fe97872df6c]
18 libmpi.so.80(ompi_mpi_instance_finalize+0xc5) [0x7fe97872f595]
19 libmpi.so.80(ompi_mpi_finalize+0x163) [0x7fe978723f73]
20 osu_reduce() [0x402716]
21 /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fe97779f493]
22 osu_reduce() [0x40294e]
=================================
It looks like coll/hcoll sends a corrupted/invalid context to hcoll_context_free.
Looks like the same cause as #9885 (?)