NA OFI: ofi+gni __gnix_rma_copy_chained_get_data() SEGV
Attached are 4 stack traces of this dump. The config is a little bit different than the previous one in that we have 4 DeltaFS compaction threads instead of 1 (so total of 11 threads instead of 8).
Traces 1, 2, and 4 all crash in __gnix_rma_copy_chained_get_data() with a SEGV. Trace 3 dies with an assert due to this error:
"Ignoring CQ event as the op is completed."
All these runs use 64 KNL compute nodes on LANL Trinitite.
Would it be possible to build a debug version of libfabric on trinitite and then get a core file(s)? I can take a look and see what might be going on. This is in a part of the gni provider that deals with unaligned GETs.
Yes, I can do this on TT. Unfortunately my LANL account accidently got expired and I'm waiting for them to re-enable it so that my cryptocard access works once again. I will do this once I can actually login to TT again!
Howard: I've got those core files on trinitite now. I will email you details.
Hi Jerome, I saw a similar issue with HEPnOS on Theta (KNL) while running on 128 nodes. 2 server processes per node, a total of 16 nodes. 2 client processes per node, a total of 112 nodes.
The top of the stack looks similar to this issue, but I am not sure if the root cause is the same. I do not get the message "Ignoring CQ event as the op is completed."
Please find the stack trace below
-1.2.11-3.12.1.x86_64
(gdb) bt full
#0 0x00002aaaab5cfb94 in __gnix_rma_copy_chained_get_data ()
from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1
No symbol table info available.
#1 0x00002aaaab5d28a3 in __gnix_rma_txd_complete ()
from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1
No symbol table info available.
#2 0x00002aaaab5c981e in __nic_tx_progress ()
from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1
No symbol table info available.
#3 0x00002aaaab5cae2d in _gnix_nic_progress ()
from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1
No symbol table info available.
#4 0x00002aaaab5ceb32 in _gnix_prog_progress ()
from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1
No symbol table info available.
#5 0x00002aaaab5972a6 in __gnix_cq_sreadfrom.isra.0 ()
from /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/lib/libfabric.so.1
No symbol table info available.
#6 0x00002aaaaac93fd3 in fi_cq_readfrom (src_addr=0x2aaaf23ffac0, count=16, buf=0x2aaaf23ffb40, cq=0x5df870)
at /gpfs/mira-home/sramesh/spack/opt/spack/cray-cnl7-haswell/gcc-7.3.0/libfabric-1.11.0-ialcbck7e2s44xcouak42d5fgvbjgf2k/include/rdma/fi_eq.h:400
No locals.
#7 na_ofi_cq_read (max_count=16, context=0x6ac370, actual_count=