Replace StreamStore and helpers with reduction function
The StreamStore is no longer needed since #1278, except that diagnostic classes still use it for doing a reduction over multiple states for output. I think we need to break apart this functionality since we really don't want the params to be mutable and keep access to multiple states.
- Add an
EndRunActionInterfacethat takes a the core params, a core state, and aSpan<CoreState<M> const*>of all states for performing a reduction in a multithread context. The action itself should know whether to do a global reduction or otherwise: probably it should always reduce toStreamId{0}. The action should check thatstate.stream_id() < all_states.size() && all_states[state.stream_id().get()] == &state. We probably want to add an MPI communicator to the core params so that we can perform reductions with dynamic parallelism. - Define an output adapter that can take a state (or aux state data plus memspace?) and write that at the end of the program. This requires lifetime considerations: we either want the state itself to be shared outside of the stepper, or the aux state vec should become a shared pointer. I'm leaning toward the latter...
- Once we do this we can also get rid of the
max_streamsparameter. - The reduction capability can later be extended so that in addition to begin/end run, we can have begin/end batch: see #809
- With reductions, we may need to be more careful about parallel execution. The
begin_runaction is taken when aStepperis created, which (for celer-sim) may happen separately in warmup and inside the parallel loop, or (for theaccelinterface) will happen during the run manager'sBeginOfRunActioncall. We should make sure theend_runaction is executed in parallel...
This is going to be troublesome for the different ways that we execute across threads. It's easy if we're doing OpenMP and know that all states are going to be starting and stopping at a synchronization point, it's easy to send a vector of state references when each thread is finished but the states are still allocated. However, if we're running through Geant4 MT, the "EndOfRunAction" will be called individually on each thread and then on the "master" thread. But we have to deallocate the state on the original thread, which creates an ordering issue.
Perhaps instead of trying to make our destructor ordering work with Geant4's threading model, we add special cases for anything that has to add Geant4 objects:
- Hit manager
- Navigation states (if using geant geometry)
Now that we have a LocalTransporter we could have it manage shared pointers to the hit processors, and then give weak pointers (for safety) + raw pointers (for performance, knowing that since the hit processes are only "shared" within a single thread we don't have to use locking) to the hit manager.
We also need to add an aux state vector interface to the StepInterface::process_steps so that step processors can have stateful data without collection mirrors.
So the order of this will be:
- Change hit processor ownership so it's managed by the local transporter but a weak pointer is kept by the
- Have the local transporter register the states (or the stepper) with the main "shared params" so that the states can be merged and finalized at once.
- Add a special case in the
LocalTransporterfor deallocating geant4 geometry states on thread (?) - Then we can start making more components stateful and gatherable: action timers, calorimeters, etc.