mercury icon indicating copy to clipboard operation
mercury copied to clipboard

NA OFI: ofi+gni __gnix_rma_copy_chained_get_data() SEGV

Open soumagne opened this issue 6 years ago • 4 comments

Attached are 4 stack traces of this dump. The config is a little bit different than the previous one in that we have 4 DeltaFS compaction threads instead of 1 (so total of 11 threads instead of 8).

Traces 1, 2, and 4 all crash in __gnix_rma_copy_chained_get_data() with a SEGV. Trace 3 dies with an assert due to this error:

"Ignoring CQ event as the op is completed."

All these runs use 64 KNL compute nodes on LANL Trinitite.

trace1.txt trace2.txt trace3.txt trace4.txt

soumagne avatar Mar 27 '19 15:03 soumagne

Would it be possible to build a debug version of libfabric on trinitite and then get a core file(s)? I can take a look and see what might be going on. This is in a part of the gni provider that deals with unaligned GETs.

hppritcha avatar Apr 10 '19 15:04 hppritcha

Yes, I can do this on TT. Unfortunately my LANL account accidently got expired and I'm waiting for them to re-enable it so that my cryptocard access works once again. I will do this once I can actually login to TT again!

chuckcranor avatar Apr 10 '19 17:04 chuckcranor

Howard: I've got those core files on trinitite now. I will email you details.

chuckcranor avatar May 07 '19 23:05 chuckcranor

Hi Jerome, I saw a similar issue with HEPnOS on Theta (KNL) while running on 128 nodes. 2 server processes per node, a total of 16 nodes. 2 client processes per node, a total of 112 nodes.

The top of the stack looks similar to this issue, but I am not sure if the root cause is the same. I do not get the message "Ignoring CQ event as the op is completed."

Please find the stack trace below

-1.2.11-3.12.1.x86_64 (gdb) bt full #0 0x00002aaaab5cfb94 in __gnix_rma_copy_chained_get_data () from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1 No symbol table info available. #1 0x00002aaaab5d28a3 in __gnix_rma_txd_complete () from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1 No symbol table info available. #2 0x00002aaaab5c981e in __nic_tx_progress () from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1 No symbol table info available. #3 0x00002aaaab5cae2d in _gnix_nic_progress () from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1 No symbol table info available. #4 0x00002aaaab5ceb32 in _gnix_prog_progress () from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1 No symbol table info available. #5 0x00002aaaab5972a6 in __gnix_cq_sreadfrom.isra.0 () from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1 No symbol table info available. #6 0x00002aaaaac93fd3 in fi_cq_readfrom (src_addr=0x2aaaf23ffac0, count=16, buf=0x2aaaf23ffb40, cq=0x5df870) at /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/include/rdma/fi_eq.h:400 No locals. #7 na_ofi_cq_read (max_count=16, context=0x6ac370, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x2aaaf23ffac0, cq_events=0x2aaaf23ffb40) at /home/sramesh/MOCHI/mercury/src/na/na_ofi.c:2576 cq_err = {op_context = 0x6d7930, flags = 1034, len = 335, buf = 0x0, data = 0, tag = 4294967339, olen = 0, err = 99, prov_errno = 0, err_data = 0x2aaaf23ffa40, err_data_size = 48} ret = NA_SUCCESS cq_hdl = 0x5df870 rc = cq_hdl = cq_err = ret = rc = func = "na_ofi_cq_read" na_ofi_op_id = na_ofi_op_id =

srini009 avatar Oct 13 '20 21:10 srini009