ADIOS2 using SST with RDMA dataplane on OLCF Frontier

Describe the bug

We are currently trying to run the this test case using SST with RDMA or another high performance dataplane from https://github.com/SCOREC/pcms/blob/bcb4fc45c6304ce33f603b94a336a8f81a239dc7/test/test_proxy_coupling.cpp but are hitting what appear to be network level initialization errors (details below).

Running with the BP4 engine completes without error.

To Reproduce

At this point we don't have a minimal reproducer. If there is a hello world sst example that has two executables we would be happy to run that for testing.

Expected behavior

SST based communication without errors.

Platform:

OS/Platform: OLCF Frontier
Build:

modules for build and run time (in addition to the defaults):

module load rocm #6.2.4 module load ninja # needed for adios2 - see configure script

list of all modules loaded at build and runtime

$ module li

Currently Loaded Modules:
  1) craype-x86-trento        5) xpmem/2.11.3-1.3_gdbda01a1eb3d   9) cray-dsmml/0.3.0     13) Core/25.03                       17) lfs-wrapper/0.0.1
  2) libfabric/1.22.0         6) cray-pmi/6.1.15                 10) cray-mpich/8.1.31    14) tmux/3.4                         18) DefApps
  3) craype-network-ofi       7) cce/18.0.1                      11) cray-libsci/24.11.0  15) darshan-runtime/3.4.6-mpi (E4S)  19) rocm/6.2.4
  4) perftools-base/24.11.0   8) craype/2.7.33                   12) PrgEnv-cray/8.6.0    16) hsi/default                      20) ninja/1.12.1

  Where:
   E4S:  E4S: Extreme-scale Scientific Software Stack (E4S) https://e4s.io/index.html

cmake config command for adios2 master @ 2463c56e0

#ninja is needed to avoid the error discussed here:
# https://github.com/ornladios/ADIOS2/issues/4597
cmake -S $SOURCE_DIR/ADIOS2 -B $BUILD_DIR/adios2 \
  -G Ninja \
  -DCMAKE_INSTALL_PREFIX=$BUILD_DIR/adios2/install \
  -DADIOS2_USE_CUDA=OFF \
  -DCMAKE_CXX_COMPILER=CC \
  -DCMAKE_C=cc \
  -DCMAKE_Fortran_COMPILER=ftn \
  -DADIOS2_USE_Fortran:STRING=ON

srun flags

Based on the SST tutorial page (https://adios2.readthedocs.io/en/latest/tutorials/sst.html#sst-engine-example)), we are including the srun network flags (--network=single_node_vni,job_vni).

runtime errors

We see the following output at runtime from our three job steps (two 'client' apps and one 'server' app):

client0.log

ADIOS2 SST Engine waiting for contact information file proxy_couple_xgc_delta_f_s2c to be created
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Tue Sep 09 15:53:11 2025] [ADIOS2 EXCEPTION] <Engine> <SstReader> <SstReader> : SstReader did not find active Writer contact info in file "proxy_couple_xgc_delta_f_s2c.sst".  Timeout or non-current SST contact file?

client1.log and server.log

RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '21181,21180').

mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Selecting DataPlane "rdma" (preferred) for use

Reader found no CXI auth key

mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Reader found no CXI auth key

none of the usable system fabrics are supported high speed interfaces (verbs, gni, psm2.) To use a compatible fabric that is being ignored (probably sockets), set the environment variable FABRIC_IFACE to the interface name. Check the output of fi_info to troubleshoot this message.
Could not find a valid transport fabric.

Additional context None

Following up ...

Sep 12 '25 15:09 cwsmith

From offline discussions the following was points were raised:

Do you have libfabric/1.22.0 in your module list?
- yes, see the module list in the OP

When you compile ADIOS, do you see Libfabric support for the HPE CXI provider: TRUE ? Or, at the end of cmake, look for message Possible RDMA DataPlanes for SST: fabric MPI

yes

$ grep HPE adios2_cmake.log 
-- Libfabric support for the HPE CXI provider: TRUE
$ grep Possible adios2_cmake.log 
  Possible RDMA DataPlanes for SST:  fabric MPI

and for the record, here is the end of the adios2 cmake configure output:

ADIOS2 build configuration:
ADIOS Version : 2.10.0.733
C++ standard  : 14
C++ Compiler  : CrayClang 18.0.1 CrayPrgEnv
  /opt/cray/pe/craype/2.7.33/bin/CC

Fortran Compiler : Cray 18.0.1 CrayPrgEnv
  /opt/cray/pe/craype/2.7.33/bin/ftn

Installation prefix: /lustre/orion/csc679/scratch/cwsmith/pcmsDev/builds/adios2/install
      bin: bin
      lib: lib64
  include: include
    cmake: lib64/cmake/adios2
   python: local/lib64/python3.12/site-packages

Features:
  Library Type: shared
  Build Type:   Release
  Testing: OFF
  Examples: OFF
  Build Options:
    DataMan            : ON
    DataSpaces         : OFF
    HDF5               : OFF
    HDF5_VOL           : OFF
    MHS                : ON
    SST                : ON
    Fortran            : ON
    MPI                : ON
    Python             : OFF
    PIP                : OFF
    BigWhoop           : OFF
    Blosc2             : OFF
    BZip2              : ON
    LIBPRESSIO         : OFF
    MGARD              : OFF
    MGARD_MDR          : OFF
    PNG                : ON
    SZ                 : OFF
    ZFP                : OFF
    DAOS               : OFF
    IME                : OFF
    O_DIRECT           : ON
    Sodium             : ON
    Catalyst           : OFF
    SysVShMem          : ON
    UCX                : OFF
    ZeroMQ             : ON
    Profiling          : ON
    Endian_Reverse     : OFF
    Derived_Variable   : OFF
    AWSSDK             : OFF
    XRootD             : OFF
    GPU_Support        : OFF
    CUDA               : OFF
    Kokkos             : OFF
    Kokkos_CUDA        : OFF
    Kokkos_HIP         : OFF
    Kokkos_SYCL        : OFF
    Campaign           : ON
    KVCACHE            : OFF
  Possible RDMA DataPlanes for SST:  fabric MPI
-- Configuring done (6.1s)
-- Generating done (0.6s)

Another is trying to use the MPI dataplane. Can you make sure all parties initialize MPI with threading (see code below). That is for surely one reason for the messages “MPI_THREAD_MULTIPLE not supported”. I am not sure if that will allow selecting the MPI dataplane but it’s a step towards that.

// MPI_THREAD_MULTIPLE is only required if you enable the SST MPI_DP
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provide);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

Update Adding the call to MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provide); results in the following log output. With this and export FABRIC_IFACE=cxi0 set, the behavior (hang after communication rounds) is the same as the run with just export FABRIC_IFACE=cxi0 set.

mpi thread level: 3 //the contents of 'provide' arg to MPI_init_tread
RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '45529,45527').
... remove repeats
Writer found CXI auth key: 45527 5                                                                                                                                                                                                                                                                                                                                                            
mpi_dp priority=0 since: Heuristics determined poor compatibility.Writer found CXI auth key: 45527 5                                                                                                                                                                                                                                                                                          
... remove repeats
Opening Stream "proxy_couple_xgc_total_f_s2c"                                                                                                                                                                                                                                                                                                                                                 
Writer stream params are:                                                                                                                                                                                                                                                                                                                                                                     
Param -   RegistrationMethod=File                                                                                                                                                                                                                                                                                                                                                             
Param -   RendezvousReaderCount=1              
.. snip

Can you set the FABRIC_IFACE environment variable to “cxi0” and run with the RDMA data plane? This shouldn’t be necessary, but I think there’s a new bug and this may be a workaround.
- Adding export FABRIC_IFACE=cxi0 to the slurm job script results in a hang at the end of the code (location not pinpointed yet) after the requested 10 communication rounds. The errors listed in the OP are no longer in the logs (Could not find a valid transport fabric. from client1.log and server.log, and Timeout or non-current SST contact file? from client0.log). Excerpts from the new run logs are below:

server.log

RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '40354,40353').
... removed repeats
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Selecting DataPlane "rdma", priority 100 for use 
Writer found CXI auth key: 40353 5
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Writer found CXI auth key: 40353 5
.... removed repeats
Opening Stream "proxy_couple_xgc_total_f_s2c"
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=rdma
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select

Writer is using Minimum Connection Communication pattern (min)
RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '40354,40353').
... removed repeats
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Selecting DataPlane "rdma", priority 100 for use
Writer found CXI auth key: 40353 5
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Writer found CXI auth key: 40353 5
... removed repeats
Opening Stream "proxy_couple_xgc_delta_f_s2c"
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=rdma
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable)
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select

....

Writer is using Minimum Connection Communication pattern (min)

Stream "proxy_couple_xgc_total_f_s2c" (0x3fa0060) summary info:
        Duration (secs) = 6.8171
        Timesteps Created = 14
        Timesteps Delivered = 14

client0.log

very similar to server.log with different integers at the end of the SLINGSHOT_VNIS outputs:

RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '40352,40353').

and the following output at the end

Stream "proxy_couple_xgc_delta_f_s2c" (0x485edb0) summary info:
        Duration (secs) = 5.10266
        Timestep Metadata Received = 14
        Timesteps Consumed = 14
        MetadataBytesReceived = 4752 (4.6 kB) 
        DataBytesReceived = 0 (0 bytes)
        PreloadBytesReceived = 0 (0 bytes)
        PreloadTimestepsReceived = 0 
        AverageReadRankFanIn = 2.7 


Stream "proxy_couple_xgc_delta_f_c2s" (0x3fabc40) summary info:
        Duration (secs) = 5.20468
        Timesteps Created = 13
        Timesteps Delivered = 13

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3736761.0 ON frontier03966 CANCELLED AT 2025-09-12T10:51:46 DUE TO TIME LIMIT ***

client1.log

same as client0.log, but with

RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '40355,40353').

and the following at the end:

Stream "proxy_couple_xgc_total_f_s2c" (0x3fcf620) summary info:
        Duration (secs) = 5.57707
        Timestep Metadata Received = 14
        Timesteps Consumed = 14
        MetadataBytesReceived = 3568 (3.5 kB)
        DataBytesReceived = 0 (0 bytes)
        PreloadBytesReceived = 0 (0 bytes)
        PreloadTimestepsReceived = 0
        AverageReadRankFanIn = 2.7


Stream "proxy_couple_xgc_total_f_c2s" (0x3f35520) summary info:
        Duration (secs) = 5.701
        Timesteps Created = 12
        Timesteps Delivered = 12

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3736761.2 ON frontier04602 CANCELLED AT 2025-09-12T10:51:46 DUE TO TIME LIMIT ***

Sep 12 '25 15:09 cwsmith

Would you please retry a run on Frontier setting the environment variable FABRIC_IFACE="cxi0"? We've identified an issue and this may be a workaround until we get it fully resolved.

Sep 12 '25 15:09 eisenhauer

Also, the ADIOS executables TestCommonWrite and TestCommonRead, created if you build with testing enabled, make a relatively simple test case. We have to run with multiple nodes on Frontier so that Slingshot gets initialized, but a 4 node job with a simple script like: export FABRIC_IFACE=cxi0 srun -n 2 -N 2 ${BIN}/TestCommonWrite SST tmp : -n 2 -N 2 ${BIN}/TestCommonRead SST tmp

Should work with BIN defined appropriately for the executable location. You shouldn't have to specify the IFACE, but we've got to get that fixed.

Sep 12 '25 16:09 eisenhauer

Would you please retry a run on Frontier setting the environment variable FABRIC_IFACE="cxi0"? We've identified an issue and this may be a workaround until we get it fully resolved.

Yes. Thank you. This is discussed in the comment here: search for "FABRIC_IFACE environment".

Sure, I'll give the adios test executables a shot.

Sep 12 '25 16:09 cwsmith

MPI_init_thread results were added to the comment here: search for "MPI_Init_thread".

Sep 12 '25 17:09 cwsmith

For what it is worth, I tried forcing use of the MPI dataplane via DataTransport=MPI for SST, and initializing mpi with threading enabled (as discussed above), but hit the following output:

mpi_dp priority=0 since: Heuristics determined poor compatibility.Selecting DataPlane "rdma" (preferred) for use

followed by a crash.

Is running with the MPI dataplane on Frontier worth pursuing? If so, I could dig into this a bit more.

Sep 15 '25 14:09 cwsmith

The "heuristics" involved here partially involve checking to see if the MPI in use is MPICH. Many non-MPICH MPIs have the required interfaces (like MPI_Open_port), but they don't actually work. It might be worth forcing the use of the MPI data plane, to see if it has a working interface even if not MPICH. However, if setting DataTransport=MPI is insufficient to make that happen, I'm not sure what else you can do at runtime, or at configure time. @vicentebolea ?

Sep 15 '25 14:09 eisenhauer

Setting DataTransport=MPI should be sufficient since setting that ignores the priority number. Note the when the heuristic fails it only setup the priority level of the MPI DP to 0 which is the baseline priority to the other DP.

Sep 17 '25 09:09 vicentebolea

Confirming for Vicente said about doing DataTransport=mpi being sufficient. Just tried it and both the selection and the operation worked (for TestCommonWrite/Read test case).

DP Writer 0 (0x3b605c0): RDMA Dataplane evaluating viability, returning priority 10                                                                                                                                     
DP Writer 0 (0x3b605c0): mpi_dp priority=0 since: Heuristics determined poor compatibility.DP Writer 0 (0x3b605c0): Prefered dataplane name is "mpi"                                                                    
DP Writer 0 (0x3b605c0): Considering DataPlane "evpath" for possible use, priority is 1                                                                                                                                 
DP Writer 0 (0x3b605c0): Considering DataPlane "rdma" for possible use, priority is 10                                                                                                                                  
DP Writer 0 (0x3b605c0): Considering DataPlane "mpi" for possible use, priority is 0                                                                                                                                    
DP Writer 0 (0x3b605c0): Selecting DataPlane "mpi" (preferred) for use

Sep 17 '25 18:09 eisenhauer

If I use adios2_set_parameter(Run.ioH,"DataTransport","WAN"); adios2_set_parameter(Run.ioH,"ControlTransport","enet"); Do I need to set to use MPI_THREAD_MULTIPLE? Thanks!

Oct 30 '25 22:10 halehawk

@halehawk you do not need MPI_THREAD_MULTIPLE if you use a dataplane other than MPI

Oct 30 '25 22:10 vicentebolea

@vicentebolea great, thank you! But I kept getting a write close error,

X-Flare_NT.e3472776.txt

such as the following: RuntimeError: ^[[1;36m[Tue Oct 28 21:32:44 2025]^[[1;34m [ADIOS2 EXCEPTION]^[[0m <Engine> <SstReader> <BP5PerformGets> : Writer failed before returning data.

Oct 30 '25 23:10 halehawk

Not sure about your error but it is def not using the mpi_dp:

DP Writer 31 (0x25655f0): mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.DP Writer 31 (0x25655f0): RDMA Dataplane unloading

Oct 30 '25 23:10 vicentebolea

@vicentebolea Thank! How can I debug the problem? Also I'd like to try "srun -n 2 -N 2 ${BIN}/TestCommonWrite SST tmp : -n 2 -N 2 ${BIN}/TestCommonRead SST tmp". How can I pass in SST WAN enet?

Oct 30 '25 23:10 halehawk

Hi. So, the "Writer failed before returning data." is from the EVPath (WAN, sockets) dataplane. You'll get that if the writer side exits before the reader gets the data. Generally it means that the writer side has died in some uncontrolled shutdown, so that's something to investigate. The reason isn't clear from the logs. However, I would not try to force the enet control plane unless necessary, as the default should suffice. Are there issues if you run the simple srun test you mention above?

Oct 31 '25 00:10 eisenhauer

@eisenhauer I just got chances to run the TestComm example using the commands mpiexec -n 2 ./TestCommonWrite SST tmp &> temp1 &

temp1.txt temp2.txt

mpiexec -n 2 ./TestCommonRead SST tmp &> temp2 It looks like they are working. So I will try not to enforce enet control.

Oct 31 '25 18:10 halehawk

I didn't enforce the enet control, and my program now works. But I have another issue: my program ran 3 simulation IO steps (include 39 adios steps) and died due to the writer-side closing. I figured it was caused by being out of memory. I looked at /proc/self/status and got the following:

rank VmPeak(KB) VmSize (KB) VmRSS(KB) VmData(KB) 0. 64696872. 63585764 9607376 18463360 19 64977632 63534752 9520740 18414921

X-Flare_NT.e3607989.txt

Do you know why VmData is almost twice that of VmRSS? I set queuelimit=13, which is the number of adios2 steps in one simulation IO step. Can I cut down the VmData size? Thanks!

Nov 13 '25 17:11 halehawk

Sorry for the delay in responding. Looking at the logs, the QueueLimit isn't coming into play here. Generally the writer-side queue can build up if the reader isn't reading timesteps as fast as the writer is producing them. But I see in the logs that the readers are releasing timestep 29 before the writer is registering timestep 30 (the last one in your logs). That's not a guarantee that there wasn't a large queue sometime in the middle of the run (I haven't looked), but at least we know there's no big queue of data at the end that might be causing you to run out of memory. It looks like maybe the writer failed while at least some of the ranks were still in EndStep for TS 30, because I see no indication that metadata for 30 was ever sent to the readers. Unfortunately the logs don't really give me any insight into why the writer died. Any possibility of looking at cores? Running with an parallel debugger? Anything like that?

Nov 18 '25 17:11 eisenhauer

Do you have any suggestions on the debugger?

On Tue, Nov 18, 2025 at 10:30 AM Greg Eisenhauer @.***> wrote:

eisenhauer left a comment (ornladios/ADIOS2#4630) https://github.com/ornladios/ADIOS2/issues/4630#issuecomment-3548781925

Sorry for the delay in responding. Looking at the logs, the QueueLimit isn't coming into play here. Generally the writer-side queue can build up if the reader isn't reading timesteps as fast as the writer is producing them. But I see in the logs that the readers are releasing timestep 29 before the writer is registering timestep 30 (the last one in your logs). That's not a guarantee that there wasn't a large queue sometime in the middle of the run (I haven't looked), but at least we know there's no big queue of data at the end that might be causing you to run out of memory. It looks like maybe the writer failed while at least some of the ranks were still in EndStep for TS 30, because I see no indication that metadata for 30 was ever sent to the readers. Unfortunately the logs don't really give me any insight into why the writer died. Any possibility of looking at cores? Running with an parallel debugger? Anything like that?

— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/4630#issuecomment-3548781925, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFDRAZCZRE63C3EYVRD35NJU5AVCNFSM6AAAAACGLIH4OWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNBYG44DCOJSGU . You are receiving this because you were mentioned.Message ID: @.***>

Nov 18 '25 21:11 halehawk

Just looking for something that might pop up and tell you if you got a segfault or something, I'd look at ddt or valgrind4hpc on frontier. ddt would be more awkward for that at scale because you've got to be connected to it, but valgrind would slow down the entire job looking for any memory problems, so it's a bit of a tradeoff. You might also set "ulimit -c unlimited" in the job to see if you get core dumps from the writer when it dies.

Nov 19 '25 13:11 eisenhauer

How about building with address sanitizer?

Nov 19 '25 14:11 cwsmith

How about building with address sanitizer?

Also a good idea. Building app and ADIOS with -fsanitize=address wouldn't be nearly as slow as valgrind. (However, my lazy tendencies usually lead me to prefer valgrind unless its going to run for hours...)

Nov 19 '25 14:11 eisenhauer

Thanks for the tip! I will try.

On Wed, Nov 19, 2025 at 7:31 AM Greg Eisenhauer @.***> wrote:

eisenhauer left a comment (ornladios/ADIOS2#4630) https://github.com/ornladios/ADIOS2/issues/4630#issuecomment-3553011668

How about building with address sanitizer?

Also a good idea. Building app and ADIOS with -fsanitize=address wouldn't be nearly as slow as valgrind. (However, my lazy tendencies usually lead me to prefer valgrind unless its going to run for hours...)

— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/4630#issuecomment-3553011668, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFGEQBL3LYVD4DXVXUT35R5MBAVCNFSM6AAAAACGLIH4OWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJTGAYTCNRWHA . You are receiving this because you were mentioned.Message ID: @.***>

Nov 20 '25 22:11 halehawk