ADIOS2
ADIOS2 copied to clipboard
Struggling to set `RoundRobin` parameter with SST
Hello,
I am trying to set the parameter RoundRobin
in my SST
writer, but it appears that the default AllToAll
is always used no matter how I try to set the parameter.
Extra context: We started investigating the use of this library in another Adios discussion here.
To Reproduce We have set up our minimal environment for you. In summary, we have N number of clients, each one is a writer. We have M number of server processes, each one is a reader. We are using SST for the engine, and we successfully run AllToAll communications.
However, when I try to set the StepDistributionMode
to RoundRobin
, nothing changes. All M servers receive all steps.
We tried to set the parameter using a variety of methods:
adios = adios2.ADIOS(comm=comm)
io = adios.DeclareIO("writerIO")
io.SetEngine("SST")
print(f"Setting distribution mode to {args.step_mode}")
io.SetParameters({"StepDistributionMode": args.step_mode})
# io.SetParameter("StepDistributionMode", args.step_mode)
But neither of these methods change the behavior of the writer.
Here you can clone the minimal working repository at https://gitlab.inria.fr/mschoule/adios2-melissa-simple-demo
And to test it you can run:
python3 launcher.py --server_np=2 --n_client=4 --client_np=2 --n_step=100 --thread_data --step_mode RoundRobin
or
python3 launcher.py --server_np=2 --n_client=4 --client_np=2 --n_step=100 --thread_data --step_mode AllToAll
The M server timesteps collected are saved to time_step_<rank>.json
in the top directory. As you will see, the same output is produced for both, meaning all M server processes got all steps from all simulations.
Expected behavior RoundRobin should follow the documented description from the Adios documentation:
“RoundRobin”, each step is delivered only to a single reader, determined in a round-robin fashion based upon the number or readers who have opened the stream at the time the step is submitted.
Desktop (please complete the following information):
- OS/Platform: Ubuntu 22.04
- Build: compiled from source
Thanks for the report. Can you please run with the environment variable "SstVerbose" set to a numeric value of 2 or more? That should let us know if SST is seeing the parameter, output will be something like this:
eisen@Endor build % export SstVerbose=2 eisen@Endor build % bin/TestCommonWrite sst tmp StepDistributionMode=RoundRobin [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CommonWriteTest [ RUN ] CommonWriteTest.ADIOS2CommonWrite Nx is set to 10 on Rank 0 Selecting DataPlane "evpath", priority 1 for use Opening Stream "tmp" Writer stream params are: Param - RegistrationMethod=File Param - RendezvousReaderCount=1 Param - QueueLimit=0 (unlimited) Param - QueueFullPolicy=Block Param - StepDistributionMode=StepsRoundRobin Param - DataTransport=evpath Param - ControlTransport=sockets Param - NetworkInterface=(default) Param - ControlInterface=(default to NetworkInterface if applicable) Param - DataInterface=(default to NetworkInterface if applicable) Param - CompressionMethod=None Param - CPCommPattern=Min Param - MarshalMethod=BP5 Param - FirstTimestepPrecious=False Param - IsRowMajor=1 (not user settable) Param - OpenTimeoutSecs=60 (seconds) Param - SpeculativePreloadMode=Auto Param - SpecAutoNodeThreshold=1 Param - ControlModule=select
Thanks for the tip, here is the output:
Opening Reader Stream.
Writer stream params are:
Param - RegistrationMethod=File
Param - RendezvousReaderCount=1
Param - QueueLimit=0 (unlimited)
Param - QueueFullPolicy=Block
Param - StepDistributionMode=StepsRoundRobin
Param - DataTransport=evpath
Param - ControlTransport=sockets
Param - NetworkInterface=(default)
Param - ControlInterface=(default to NetworkInterface if applicable)
Param - DataInterface=(default to NetworkInterface if applicable)
Param - CompressionMethod=None
Param - CPCommPattern=Min
Param - MarshalMethod=BP5
Param - FirstTimestepPrecious=False
Param - IsRowMajor=1 (not user settable)
Param - OpenTimeoutSecs=60 (seconds)
Param - SpeculativePreloadMode=Auto
Param - SpecAutoNodeThreshold=1
Param - ControlModule=select
Reader stream params are:
Param - RegistrationMethod=File
Param - DataTransport=evpath
Param - ControlTransport=sockets
Param - NetworkInterface=(default)
Param - ControlInterface=(default to NetworkInterface if applicable)
Param - DataInterface=(default to NetworkInterface if applicable)
Param - AlwaysProvideLatestTimestep=False
Param - OpenTimeoutSecs=1 (seconds)
Param - SpeculativePreloadMode=Auto
Param - SpecAutoNodeThreshold=1
Param - ControlModule=select
Which seems to indicate that infact, RoundRobin is set on the backend.
Interesting... I just checked to see that our CI test that covers RoundRobin distribution is still working, and it seems to be. You might kick that SstVerbose parameter up to '4', which should get you more detailed information about timestep distribution. Probably only necessary to do that on the writer side. Here's what a portion of the output looks like for our CI test, you can see the Round Robin distribution info and where each step was sent:
Writer 0 (0x15af1d620): Sending TimestepMetadata for timestep 5 (ref count 1), one to each reader Writer 0 (0x15af1d620): Round Robin Distribution, step sent to reader 2 Writer 0 (0x15af1d620): Sent timestep 5 to reader cohort 2 Writer 0 (0x15af1d620): ADDING timestep 5 to sent list for reader cohort 2, READER 0x600002053400, reference count is now 2 Writer 0 (0x15af1d620): PRELOADMODE for timestep 5 non-default for reader , active at timestep 0, mode 1 DP Writer 0 (0x15af1d620): Per reader registration for timestep 5, preload mode 1 DP Writer 0 (0x15af1d620): Sending Speculative Preload messages, reader 0x600001b44900, timestep 5 Writer 0 (0x15af1d620): Removing dead entries Writer 0 (0x15af1d620): QueueMaintenance complete Writer 0 (0x15af1d620): Reader sent timestep list 0x600000c44210, trying to release 5 Writer 0 (0x15af1d620): Writer tagging timestep 3 as expired DP Writer 0 (0x15af1d620): Releasing timestep 3 Writer 0 (0x15af1d620): Removing dead entries Writer 0 (0x15af1d620): Remove queue Entries removing Timestep 3 (exp 1, Prec 0, Ref 0), Count now 2 Writer 0 (0x15af1d620): QueueMaintenance complete DP Writer 0 (0x15af1d620): ProvideTimestep, registering timestep 6, data 0x15b046e00, fprint 41070373fd07d306 Writer 0 (0x15af1d620): Removing dead entries Writer 0 (0x15af1d620): QueueMaintenance complete Writer 0 (0x15af1d620): Sending TimestepMetadata for timestep 6 (ref count 1), one to each reader Writer 0 (0x15af1d620): Round Robin Distribution, step sent to reader 0 Writer 0 (0x15af1d620): Sent timestep 6 to reader cohort 0 Writer 0 (0x15af1d620): ADDING timestep 6 to sent list for reader cohort 0, READER 0x600002053200, reference count is now 2 Writer 0 (0x15af1d620): PRELOADMODE for timestep 6 non-default for reader , active at timestep 0, mode 1 DP Writer 0 (0x15af1d620): Per reader registration for timestep 6, preload mode 1 DP Writer 0 (0x15af1d620): Sending Speculative Preload messages, reader 0x600001b44840, timestep 6 Writer 0 (0x15af1d620): Removing dead entries Writer 0 (0x15af1d620): QueueMaintenance complete
Ok, I increased the writer verbosity as you suggested. It produced the following output. I notice that I only have sent to reader 0
where you have sent to reader 2
and sent to reader 0
(I also only have cohort 0, you have cohort 1 and cohort 2). This tells me that I may not be initializing the readers correctly? Do you have a minimal working example of initializing a round robin reader configuration?
Writer 0 (0x55715c02fcf0): Sst set to use sockets as a Control Transport
DP Writer 0 (0x55715c02fcf0): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x55715c02fcf0): Selecting DataPlane "evpath", priority 1 for use
Writer 0 (0x55715c02fcf0): Opening Stream "melissa.sid-0"
Writer 0 (0x55715c02fcf0): Writer stream params are:
Param - RegistrationMethod=File
Param - RendezvousReaderCount=1
Param - QueueLimit=0 (unlimited)
Param - QueueFullPolicy=Block
Param - StepDistributionMode=StepsRoundRobin
Param - DataTransport=evpath
Param - ControlTransport=sockets
Param - NetworkInterface=(default)
Param - ControlInterface=(default to NetworkInterface if applicable)
Param - DataInterface=(default to NetworkInterface if applicable)
Param - CompressionMethod=None
Param - CPCommPattern=Min
Param - MarshalMethod=BP5
Param - FirstTimestepPrecious=False
Param - IsRowMajor=1 (not user settable)
Param - OpenTimeoutSecs=60 (seconds)
Param - SpeculativePreloadMode=Auto
Param - SpecAutoNodeThreshold=1
Param - ControlModule=select
Writer 0 (0x55715c02fcf0): Stream "melissa.sid-0" waiting for 1 readers
Writer 0 (0x55715c02fcf0): Beginning writer-side reader open protocol
Writer 0 (0x55715c02fcf0): Finish writer-side reader open protocol for reader 0x55715c0488a0, reader ready response pending
Writer 0 (0x55715c02fcf0): (PID cb87, TID 7fbee9fed000) Waiting for Reader ready on WSR 0x55715c0488a0.
Writer 0 (0x55715c02fcf0): Reader Activate message received for Stream 0x55715c0488a0. Setting state to Established.
Writer 0 (0x55715c02fcf0): Parent stream reader count is now 1.
Writer 0 (0x55715c02fcf0): Reader ready on WSR 0x55715c0488a0, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x55715c02fcf0): Finish opening Stream "melissa.sid-0"
DP Writer 0 (0x55715c02fcf0): ProvideTimestep, registering timestep 0, data 0x55715c04e970, fprint 5222418180845f8
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Writer 0 (0x55715c02fcf0): Round Robin Distribution, step sent to reader 0
Writer 0 (0x55715c02fcf0): Sent timestep 0 to reader cohort 0
Writer 0 (0x55715c02fcf0): ADDING timestep 0 to sent list for reader cohort 0, READER 0x55715c0488a0, reference count is now 2
DP Writer 0 (0x55715c02fcf0): Per reader registration for timestep 0, preload mode 0
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
DP Writer 0 (0x55715c02fcf0): ProvideTimestep, registering timestep 1, data 0x55715c0568d0, fprint 5222418180845f8
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Sending TimestepMetadata for timestep 1 (ref count 1), one to each reader
Writer 0 (0x55715c02fcf0): Round Robin Distribution, step sent to reader 0
Writer 0 (0x55715c02fcf0): Sent timestep 1 to reader cohort 0
Writer 0 (0x55715c02fcf0): ADDING timestep 1 to sent list for reader cohort 0, READER 0x55715c0488a0, reference count is now 2
DP Writer 0 (0x55715c02fcf0): Per reader registration for timestep 1, preload mode 0
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
DP Writer 0 (0x55715c02fcf0): ProvideTimestep, registering timestep 2, data 0x55715c056e40, fprint 5222418180845f8
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Sending TimestepMetadata for timestep 2 (ref count 1), one to each reader
Writer 0 (0x55715c02fcf0): Round Robin Distribution, step sent to reader 0
Writer 0 (0x55715c02fcf0): Sent timestep 2 to reader cohort 0
Writer 0 (0x55715c02fcf0): ADDING timestep 2 to sent list for reader cohort 0, READER 0x55715c0488a0, reference count is now 2
DP Writer 0 (0x55715c02fcf0): Per reader registration for timestep 2, preload mode 0
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): SstWriterClose, Sending Close at Timestep 2, one to each reader
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Waiting for timesteps to be released in WriterClose
Writer 0 (0x55715c02fcf0): Reader sent timestep list 0x55715c04f910, trying to release 0
Writer 0 (0x55715c02fcf0): Writer tagging timestep 0 as expired
DP Writer 0 (0x55715c02fcf0): Releasing timestep 0
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 2
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Waiting for timesteps to be released in WriterClose
Writer 0 (0x55715c02fcf0): Reader sent timestep list 0x55715c04e7b0, trying to release 1
Writer 0 (0x55715c02fcf0): Writer tagging timestep 1 as expired
DP Writer 0 (0x55715c02fcf0): Releasing timestep 1
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): Remove queue Entries removing Timestep 1 (exp 1, Prec 0, Ref 0), Count now 1
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Waiting for timesteps to be released in WriterClose
Writer 0 (0x55715c02fcf0): Reader sent timestep list 0x55715c04e4d0, trying to release 2
Writer 0 (0x55715c02fcf0): Writer tagging timestep 2 as expired
DP Writer 0 (0x55715c02fcf0): Releasing timestep 2
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): Remove queue Entries removing Timestep 2 (exp 1, Prec 0, Ref 0), Count now 0
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0):
Stream "melissa.sid-0" (0x55715c02fcf0) summary info:
Writer 0 (0x55715c02fcf0): Duration (secs) = 0.624963
Writer 0 (0x55715c02fcf0): Timesteps Created = 3
Writer 0 (0x55715c02fcf0): Timesteps Delivered = 3
Writer 0 (0x55715c02fcf0):
Writer 0 (0x55715c02fcf0): All timesteps are released in WriterClose
Writer 0 (0x55715c02fcf0): Destroying stream 0x55715c02fcf0, name melissa.sid-0
Writer 0 (0x55715c02fcf0): Reference count now zero, Destroying process SST info cache
Writer 0 (0x55715c02fcf0): Freeing LastCallList
Writer 0 (0x7ffeb3258700): SstStreamDestroy successful, returning
Follow up question:
When using RoundRobin with adios, do all connected readers need to BeginStep
and EndStep
for all timesteps still? I assume the writer is deciding who gets the data in this case.
I am trying to work around the issue above by running the round robin on the reader side by deciding which reader should read from the writer. But it seems that it still wants all readers to read all timesteps.
I realize this is likely the incorrect work around - but I am unsure how else to achieve RoundRobin with our current setup. Perhaps our setup is unique or incorrect (although we are following the explicit instructions from https://github.com/ornladios/ADIOS2/discussions/3675#discussioncomment-6372520)
Here, we have created our exact configuration in a MWE for you to test out, incase you'd like to see how we are trying to use Adios2:
https://gitlab.inria.fr/mschoule/adios2-melissa-simple-demo
Ah, we may have a conceptual disconnect. It looks like you just have a single MPI reader application connected to the writer. That reader has multiple ranks, but since ADIOS is designed for communication between MPI applications, it assumes that all the writer/reader ranks in an application act cooperatively. None of SST's distribution modes come into play because there is only one reader application and it gets all the timesteps. Each of the reader's ranks might select different parts of the incoming arrays, but they will all come from the same set of data that the writer ranks created for that timestep. The RoundRobin distribution mode was designed to scatter created timesteps to multiple reader applications. There is a test in ADIOS that does this and you can try it by first running the writer like this:
bin/TestDistributionWrite SST RR.sst RendezvousReaderCount=2 --round_robin
This should wait for two readers to connect to it.
Then start up two separate terminal windows, cd to the same directory and in each one do:
bin/TestDistributionRead SST RR.SST --round_robin
If you have SstVerbose turned on you should see the timesteps alternating WRT which reader application they are delivered to.
Note that I didn't run with MPI above, so we only have a single rank for the writer and each of the two readers. They could each be MPI applications.
Follow up question:
When using RoundRobin with adios, do all connected readers need to
BeginStep
andEndStep
for all timesteps still? I assume the writer is deciding who gets the data in this case.I am trying to work around the issue above by running the round robin on the reader side by deciding which reader should read from the writer. But it seems that it still wants all readers to read all timesteps.
I realize this is likely the incorrect work around - but I am unsure how else to achieve RoundRobin with our current setup. Perhaps our setup is unique or incorrect (although we are following the explicit instructions from #3675 (reply in thread))
Here, we have created our exact configuration in a MWE for you to test out, incase you'd like to see how we are trying to use Adios2:
https://gitlab.inria.fr/mschoule/adios2-melissa-simple-demo
Sorry, I hadn't had time to go through your demo, and may not yet today. But generally if you pass an MPI communicator in to ADIOS initialization, then a bunch of things in ADIOS are collective operations. Every rank has do to Open(), BeginStep, EndStep, etc. However, you might get to where you want to be by NOT passing the mpi communicator in to ADIOS. Then each rank will operate completely independently as if it were it's own separate 1-rank application. That may be good or bad depending upon exactly what you're trying to do. (I.E. if you want everything to run sort of in lock-step, this isn't the way.)
Ok, thanks for conveying the internal philosophy of the Adios2 round robin distribution method.
Unfortunately - our application depends on all readers sitting on the same MPI application.
I will try your suggestion of not passing the mpi communicator to adios, thanks for the tip!
Just an update, I have managed to get RoundRobin
working for our configuration by taking your suggestion and removing the MPI communicator from the reader side adios2.ADIOS()
initialization. The client side still takes its own client communicator (since our clients are individual MPI applications).
Now it is "working" in our toy example. But I am wondering, what are the ramifications on the adios backend? You say "if you want everything to run in lock step then it isnt the way." Maybe I misunderstand, but we are still using our own MPI communicator on our side - so we have full control over the lock-step nature of our reading (in case we want/don't want that). So are you referring to something intrinisic to the adios back end? For example, without the communicator, is there some undefined behavior possible in the step distribution on adios' side?
In all cases, thanks a lot for your assistance. I think your previous tips enable us to move away from out the toy example and try integration into our software.
WRT what I meant by that comment, I'd go back to ADIOS' origins. It was designed to pass information between timestep-oriented simulation and analysis jobs where the prominent data structures were global arrays decomposed across the writer ranks with different portions of them consumed by each reader rank. In that context, ADIOS makes sure that the reader ranks are all working on the same timestep at the same time, etc. You're just have a bit more of a novel use case, so ADIOS isn't in that role. I don't think there should be any undefined behavior (at least WRT MPI). Hopefully the more defined behavior is also appropriate for your situation. Reader-side ADIOS BeginStep() without timeout will block until it gets data, which may hold up one of your ranks until its turn to get data sent to it (which might in turn hold up your whole application because your own collective MPI operations might wait for that rank to run again). There is a timeout parameter to BeginStep that you can use to help manage that, but with RoundRobin data sent to a particular reader is his to consume and won't be available any other reader. So one reader that didn't do BeginStep for a while could have a queue while another might have run through all his data. Maybe that's not a problem because it just doesn't matter or your outside-of-adios synchronization keeps that sort of thing in check. If it was a problem, you might also consider the OnDemand
distribution mode, where each writer-side timestep is sent to the next reader that asks for it, rather than to specific readers in sequence as in RoundRobin. But again, depends upon your use-case. Happy to chat more if anything seems weird when you're integrating.