using SST with RDMA dataplane on OLCF Frontier
Describe the bug
We are currently trying to run the this test case using SST with RDMA or another high performance dataplane from https://github.com/SCOREC/pcms/blob/bcb4fc45c6304ce33f603b94a336a8f81a239dc7/test/test_proxy_coupling.cpp but are hitting what appear to be network level initialization errors (details below).
Running with the BP4 engine completes without error.
To Reproduce
At this point we don't have a minimal reproducer. If there is a hello world sst example that has two executables we would be happy to run that for testing.
Expected behavior
SST based communication without errors.
Platform:
- OS/Platform: OLCF Frontier
- Build:
modules for build and run time (in addition to the defaults):
module load rocm #6.2.4 module load ninja # needed for adios2 - see configure script
list of all modules loaded at build and runtime
$ module li
Currently Loaded Modules:
1) craype-x86-trento 5) xpmem/2.11.3-1.3_gdbda01a1eb3d 9) cray-dsmml/0.3.0 13) Core/25.03 17) lfs-wrapper/0.0.1
2) libfabric/1.22.0 6) cray-pmi/6.1.15 10) cray-mpich/8.1.31 14) tmux/3.4 18) DefApps
3) craype-network-ofi 7) cce/18.0.1 11) cray-libsci/24.11.0 15) darshan-runtime/3.4.6-mpi (E4S) 19) rocm/6.2.4
4) perftools-base/24.11.0 8) craype/2.7.33 12) PrgEnv-cray/8.6.0 16) hsi/default 20) ninja/1.12.1
Where:
E4S: E4S: Extreme-scale Scientific Software Stack (E4S) https://e4s.io/index.html
cmake config command for adios2 master @ 2463c56e0
#ninja is needed to avoid the error discussed here:
# https://github.com/ornladios/ADIOS2/issues/4597
cmake -S $SOURCE_DIR/ADIOS2 -B $BUILD_DIR/adios2 \
-G Ninja \
-DCMAKE_INSTALL_PREFIX=$BUILD_DIR/adios2/install \
-DADIOS2_USE_CUDA=OFF \
-DCMAKE_CXX_COMPILER=CC \
-DCMAKE_C=cc \
-DCMAKE_Fortran_COMPILER=ftn \
-DADIOS2_USE_Fortran:STRING=ON
srun flags
Based on the SST tutorial page (https://adios2.readthedocs.io/en/latest/tutorials/sst.html#sst-engine-example)), we are including the srun network flags (--network=single_node_vni,job_vni).
runtime errors
We see the following output at runtime from our three job steps (two 'client' apps and one 'server' app):
client0.log
ADIOS2 SST Engine waiting for contact information file proxy_couple_xgc_delta_f_s2c to be created
terminate called after throwing an instance of 'std::runtime_error'
what(): [Tue Sep 09 15:53:11 2025] [ADIOS2 EXCEPTION] <Engine> <SstReader> <SstReader> : SstReader did not find active Writer contact info in file "proxy_couple_xgc_delta_f_s2c.sst". Timeout or non-current SST contact file?
client1.log and server.log
RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '21181,21180').
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Selecting DataPlane "rdma" (preferred) for use
Reader found no CXI auth key
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Reader found no CXI auth key
none of the usable system fabrics are supported high speed interfaces (verbs, gni, psm2.) To use a compatible fabric that is being ignored (probably sockets), set the environment variable FABRIC_IFACE to the interface name. Check the output of fi_info to troubleshoot this message.
Could not find a valid transport fabric.
Additional context None
Following up ...
From offline discussions the following was points were raised:
-
Do you have
libfabric/1.22.0in your module list?- yes, see the module list in the OP
-
When you compile ADIOS, do you see
Libfabric support for the HPE CXI provider: TRUE? Or, at the end of cmake, look for messagePossible RDMA DataPlanes for SST: fabric MPI- yes
$ grep HPE adios2_cmake.log -- Libfabric support for the HPE CXI provider: TRUE $ grep Possible adios2_cmake.log Possible RDMA DataPlanes for SST: fabric MPIand for the record, here is the end of the adios2 cmake configure output:
ADIOS2 build configuration: ADIOS Version : 2.10.0.733 C++ standard : 14 C++ Compiler : CrayClang 18.0.1 CrayPrgEnv /opt/cray/pe/craype/2.7.33/bin/CC Fortran Compiler : Cray 18.0.1 CrayPrgEnv /opt/cray/pe/craype/2.7.33/bin/ftn Installation prefix: /lustre/orion/csc679/scratch/cwsmith/pcmsDev/builds/adios2/install bin: bin lib: lib64 include: include cmake: lib64/cmake/adios2 python: local/lib64/python3.12/site-packages Features: Library Type: shared Build Type: Release Testing: OFF Examples: OFF Build Options: DataMan : ON DataSpaces : OFF HDF5 : OFF HDF5_VOL : OFF MHS : ON SST : ON Fortran : ON MPI : ON Python : OFF PIP : OFF BigWhoop : OFF Blosc2 : OFF BZip2 : ON LIBPRESSIO : OFF MGARD : OFF MGARD_MDR : OFF PNG : ON SZ : OFF ZFP : OFF DAOS : OFF IME : OFF O_DIRECT : ON Sodium : ON Catalyst : OFF SysVShMem : ON UCX : OFF ZeroMQ : ON Profiling : ON Endian_Reverse : OFF Derived_Variable : OFF AWSSDK : OFF XRootD : OFF GPU_Support : OFF CUDA : OFF Kokkos : OFF Kokkos_CUDA : OFF Kokkos_HIP : OFF Kokkos_SYCL : OFF Campaign : ON KVCACHE : OFF Possible RDMA DataPlanes for SST: fabric MPI -- Configuring done (6.1s) -- Generating done (0.6s) -
Another is trying to use the MPI dataplane. Can you make sure all parties initialize MPI with threading (see code below). That is for surely one reason for the messages “MPI_THREAD_MULTIPLE not supported”. I am not sure if that will allow selecting the MPI dataplane but it’s a step towards that.
// MPI_THREAD_MULTIPLE is only required if you enable the SST MPI_DP MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provide); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);-
Update Adding the call to
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provide);results in the following log output. With this andexport FABRIC_IFACE=cxi0set, the behavior (hang after communication rounds) is the same as the run with justexport FABRIC_IFACE=cxi0set.
mpi thread level: 3 //the contents of 'provide' arg to MPI_init_tread RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '45529,45527'). ... remove repeats Writer found CXI auth key: 45527 5 mpi_dp priority=0 since: Heuristics determined poor compatibility.Writer found CXI auth key: 45527 5 ... remove repeats Opening Stream "proxy_couple_xgc_total_f_s2c" Writer stream params are: Param - RegistrationMethod=File Param - RendezvousReaderCount=1 .. snip -
Update Adding the call to
-
Can you set the FABRIC_IFACE environment variable to “cxi0” and run with the RDMA data plane? This shouldn’t be necessary, but I think there’s a new bug and this may be a workaround.
- Adding
export FABRIC_IFACE=cxi0to the slurm job script results in a hang at the end of the code (location not pinpointed yet) after the requested 10 communication rounds. The errors listed in the OP are no longer in the logs (Could not find a valid transport fabric.from client1.log and server.log, andTimeout or non-current SST contact file?from client0.log). Excerpts from the new run logs are below:
- Adding
server.log
RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '40354,40353').
... removed repeats
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Selecting DataPlane "rdma", priority 100 for use
Writer found CXI auth key: 40353 5
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Writer found CXI auth key: 40353 5
.... removed repeats
Opening Stream "proxy_couple_xgc_total_f_s2c"
Writer stream params are:
Param - RegistrationMethod=File
Param - RendezvousReaderCount=1
Param - QueueLimit=0 (unlimited)
Param - QueueFullPolicy=Block
Param - StepDistributionMode=StepsAllToAll
Param - DataTransport=rdma
Param - ControlTransport=sockets
Param - NetworkInterface=(default)
Param - ControlInterface=(default to NetworkInterface if applicable)
Param - DataInterface=(default to NetworkInterface if applicable)
Param - CompressionMethod=None
Param - CPCommPattern=Min
Param - MarshalMethod=BP5
Param - FirstTimestepPrecious=False
Param - IsRowMajor=1 (not user settable)
Param - OpenTimeoutSecs=60 (seconds)
Param - SpeculativePreloadMode=Auto
Param - SpecAutoNodeThreshold=1
Param - ControlModule=select
Writer is using Minimum Connection Communication pattern (min)
RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '40354,40353').
... removed repeats
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Selecting DataPlane "rdma", priority 100 for use
Writer found CXI auth key: 40353 5
mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.Writer found CXI auth key: 40353 5
... removed repeats
Opening Stream "proxy_couple_xgc_delta_f_s2c"
Writer stream params are:
Param - RegistrationMethod=File
Param - RendezvousReaderCount=1
Param - QueueLimit=0 (unlimited)
Param - QueueFullPolicy=Block
Param - StepDistributionMode=StepsAllToAll
Param - DataTransport=rdma
Param - ControlTransport=sockets
Param - NetworkInterface=(default)
Param - ControlInterface=(default to NetworkInterface if applicable)
Param - DataInterface=(default to NetworkInterface if applicable)
Param - CompressionMethod=None
Param - CPCommPattern=Min
Param - MarshalMethod=BP5
Param - FirstTimestepPrecious=False
Param - IsRowMajor=1 (not user settable)
Param - OpenTimeoutSecs=60 (seconds)
Param - SpeculativePreloadMode=Auto
Param - SpecAutoNodeThreshold=1
Param - ControlModule=select
....
Writer is using Minimum Connection Communication pattern (min)
Stream "proxy_couple_xgc_total_f_s2c" (0x3fa0060) summary info:
Duration (secs) = 6.8171
Timesteps Created = 14
Timesteps Delivered = 14
client0.log
very similar to server.log with different integers at the end of the SLINGSHOT_VNIS outputs:
RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '40352,40353').
and the following output at the end
Stream "proxy_couple_xgc_delta_f_s2c" (0x485edb0) summary info:
Duration (secs) = 5.10266
Timestep Metadata Received = 14
Timesteps Consumed = 14
MetadataBytesReceived = 4752 (4.6 kB)
DataBytesReceived = 0 (0 bytes)
PreloadBytesReceived = 0 (0 bytes)
PreloadTimestepsReceived = 0
AverageReadRankFanIn = 2.7
Stream "proxy_couple_xgc_delta_f_c2s" (0x3fabc40) summary info:
Duration (secs) = 5.20468
Timesteps Created = 13
Timesteps Delivered = 13
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3736761.0 ON frontier03966 CANCELLED AT 2025-09-12T10:51:46 DUE TO TIME LIMIT ***
client1.log
same as client0.log, but with
RDMA Dataplane trying to check for an available CXI provider since environment variable SLINGSHOT_VNIS is defined (value: '40355,40353').
and the following at the end:
Stream "proxy_couple_xgc_total_f_s2c" (0x3fcf620) summary info:
Duration (secs) = 5.57707
Timestep Metadata Received = 14
Timesteps Consumed = 14
MetadataBytesReceived = 3568 (3.5 kB)
DataBytesReceived = 0 (0 bytes)
PreloadBytesReceived = 0 (0 bytes)
PreloadTimestepsReceived = 0
AverageReadRankFanIn = 2.7
Stream "proxy_couple_xgc_total_f_c2s" (0x3f35520) summary info:
Duration (secs) = 5.701
Timesteps Created = 12
Timesteps Delivered = 12
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3736761.2 ON frontier04602 CANCELLED AT 2025-09-12T10:51:46 DUE TO TIME LIMIT ***
Would you please retry a run on Frontier setting the environment variable FABRIC_IFACE="cxi0"? We've identified an issue and this may be a workaround until we get it fully resolved.
Also, the ADIOS executables TestCommonWrite and TestCommonRead, created if you build with testing enabled, make a relatively simple test case. We have to run with multiple nodes on Frontier so that Slingshot gets initialized, but a 4 node job with a simple script like: export FABRIC_IFACE=cxi0 srun -n 2 -N 2 ${BIN}/TestCommonWrite SST tmp : -n 2 -N 2 ${BIN}/TestCommonRead SST tmp
Should work with BIN defined appropriately for the executable location. You shouldn't have to specify the IFACE, but we've got to get that fixed.
Would you please retry a run on Frontier setting the environment variable FABRIC_IFACE="cxi0"? We've identified an issue and this may be a workaround until we get it fully resolved.
Yes. Thank you. This is discussed in the comment here: search for "FABRIC_IFACE environment".
Sure, I'll give the adios test executables a shot.
MPI_init_thread results were added to the comment here: search for "MPI_Init_thread".
For what it is worth, I tried forcing use of the MPI dataplane via DataTransport=MPI for SST, and initializing mpi with threading enabled (as discussed above), but hit the following output:
mpi_dp priority=0 since: Heuristics determined poor compatibility.Selecting DataPlane "rdma" (preferred) for use
followed by a crash.
Is running with the MPI dataplane on Frontier worth pursuing? If so, I could dig into this a bit more.
The "heuristics" involved here partially involve checking to see if the MPI in use is MPICH. Many non-MPICH MPIs have the required interfaces (like MPI_Open_port), but they don't actually work. It might be worth forcing the use of the MPI data plane, to see if it has a working interface even if not MPICH. However, if setting DataTransport=MPI is insufficient to make that happen, I'm not sure what else you can do at runtime, or at configure time. @vicentebolea ?
Setting DataTransport=MPI should be sufficient since setting that ignores the priority number.
Note the when the heuristic fails it only setup the priority level of the MPI DP to 0 which is the baseline priority to the other DP.
Confirming for Vicente said about doing DataTransport=mpi being sufficient. Just tried it and both the selection and the operation worked (for TestCommonWrite/Read test case).
DP Writer 0 (0x3b605c0): RDMA Dataplane evaluating viability, returning priority 10
DP Writer 0 (0x3b605c0): mpi_dp priority=0 since: Heuristics determined poor compatibility.DP Writer 0 (0x3b605c0): Prefered dataplane name is "mpi"
DP Writer 0 (0x3b605c0): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x3b605c0): Considering DataPlane "rdma" for possible use, priority is 10
DP Writer 0 (0x3b605c0): Considering DataPlane "mpi" for possible use, priority is 0
DP Writer 0 (0x3b605c0): Selecting DataPlane "mpi" (preferred) for use
If I use adios2_set_parameter(Run.ioH,"DataTransport","WAN"); adios2_set_parameter(Run.ioH,"ControlTransport","enet"); Do I need to set to use MPI_THREAD_MULTIPLE? Thanks!
@halehawk you do not need MPI_THREAD_MULTIPLE if you use a dataplane other than MPI
@vicentebolea great, thank you! But I kept getting a write close error,
such as the following: RuntimeError: ^[[1;36m[Tue Oct 28 21:32:44 2025]^[[1;34m [ADIOS2 EXCEPTION]^[[0m <Engine> <SstReader> <BP5PerformGets> : Writer failed before returning data.
Not sure about your error but it is def not using the mpi_dp:
DP Writer 31 (0x25655f0): mpi_dp priority=-1 since: MPI_THREAD_MULTIPLE not supported by MPI.DP Writer 31 (0x25655f0): RDMA Dataplane unloading
@vicentebolea Thank! How can I debug the problem? Also I'd like to try "srun -n 2 -N 2 ${BIN}/TestCommonWrite SST tmp : -n 2 -N 2 ${BIN}/TestCommonRead SST tmp". How can I pass in SST WAN enet?
Hi. So, the "Writer failed before returning data." is from the EVPath (WAN, sockets) dataplane. You'll get that if the writer side exits before the reader gets the data. Generally it means that the writer side has died in some uncontrolled shutdown, so that's something to investigate. The reason isn't clear from the logs. However, I would not try to force the enet control plane unless necessary, as the default should suffice. Are there issues if you run the simple srun test you mention above?
@eisenhauer I just got chances to run the TestComm example using the commands mpiexec -n 2 ./TestCommonWrite SST tmp &> temp1 &
mpiexec -n 2 ./TestCommonRead SST tmp &> temp2 It looks like they are working. So I will try not to enforce enet control.
I didn't enforce the enet control, and my program now works. But I have another issue: my program ran 3 simulation IO steps (include 39 adios steps) and died due to the writer-side closing. I figured it was caused by being out of memory. I looked at /proc/self/status and got the following:
rank VmPeak(KB) VmSize (KB) VmRSS(KB) VmData(KB) 0. 64696872. 63585764 9607376 18463360 19 64977632 63534752 9520740 18414921
Do you know why VmData is almost twice that of VmRSS? I set queuelimit=13, which is the number of adios2 steps in one simulation IO step. Can I cut down the VmData size? Thanks!
Sorry for the delay in responding. Looking at the logs, the QueueLimit isn't coming into play here. Generally the writer-side queue can build up if the reader isn't reading timesteps as fast as the writer is producing them. But I see in the logs that the readers are releasing timestep 29 before the writer is registering timestep 30 (the last one in your logs). That's not a guarantee that there wasn't a large queue sometime in the middle of the run (I haven't looked), but at least we know there's no big queue of data at the end that might be causing you to run out of memory. It looks like maybe the writer failed while at least some of the ranks were still in EndStep for TS 30, because I see no indication that metadata for 30 was ever sent to the readers. Unfortunately the logs don't really give me any insight into why the writer died. Any possibility of looking at cores? Running with an parallel debugger? Anything like that?
Do you have any suggestions on the debugger?
On Tue, Nov 18, 2025 at 10:30 AM Greg Eisenhauer @.***> wrote:
eisenhauer left a comment (ornladios/ADIOS2#4630) https://github.com/ornladios/ADIOS2/issues/4630#issuecomment-3548781925
Sorry for the delay in responding. Looking at the logs, the QueueLimit isn't coming into play here. Generally the writer-side queue can build up if the reader isn't reading timesteps as fast as the writer is producing them. But I see in the logs that the readers are releasing timestep 29 before the writer is registering timestep 30 (the last one in your logs). That's not a guarantee that there wasn't a large queue sometime in the middle of the run (I haven't looked), but at least we know there's no big queue of data at the end that might be causing you to run out of memory. It looks like maybe the writer failed while at least some of the ranks were still in EndStep for TS 30, because I see no indication that metadata for 30 was ever sent to the readers. Unfortunately the logs don't really give me any insight into why the writer died. Any possibility of looking at cores? Running with an parallel debugger? Anything like that?
— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/4630#issuecomment-3548781925, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFDRAZCZRE63C3EYVRD35NJU5AVCNFSM6AAAAACGLIH4OWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNBYG44DCOJSGU . You are receiving this because you were mentioned.Message ID: @.***>
Just looking for something that might pop up and tell you if you got a segfault or something, I'd look at ddt or valgrind4hpc on frontier. ddt would be more awkward for that at scale because you've got to be connected to it, but valgrind would slow down the entire job looking for any memory problems, so it's a bit of a tradeoff. You might also set "ulimit -c unlimited" in the job to see if you get core dumps from the writer when it dies.
How about building with address sanitizer?
How about building with address sanitizer?
Also a good idea. Building app and ADIOS with -fsanitize=address wouldn't be nearly as slow as valgrind. (However, my lazy tendencies usually lead me to prefer valgrind unless its going to run for hours...)
Thanks for the tip! I will try.
On Wed, Nov 19, 2025 at 7:31 AM Greg Eisenhauer @.***> wrote:
eisenhauer left a comment (ornladios/ADIOS2#4630) https://github.com/ornladios/ADIOS2/issues/4630#issuecomment-3553011668
How about building with address sanitizer?
Also a good idea. Building app and ADIOS with -fsanitize=address wouldn't be nearly as slow as valgrind. (However, my lazy tendencies usually lead me to prefer valgrind unless its going to run for hours...)
— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/4630#issuecomment-3553011668, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFGEQBL3LYVD4DXVXUT35R5MBAVCNFSM6AAAAACGLIH4OWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNJTGAYTCNRWHA . You are receiving this because you were mentioned.Message ID: @.***>