Restructure stepper/transporter/state/params
The Stepper should be refactored since it's grown from a single-purpose "do one step" to a general interface:
- I don't think it's right that it owns the state, since we have to extract data from it occasionally. We also set
params.set_state(stream_id.get(), step_->sp_state())inaccelso we can do data reductions. (In total there's three accessors to get the state.) - It's no longer a function-like object since we have so many other functions attached to it.
- The
step(primaries)is a little odd. - The stepper owns an action sequence. Although the action times currently are "local", we ought to have one action sequence per Params.
- We should have an "end run" called over all states, and "begin run" should be over all states as well. This is necessary for global reductions.
Current construction order relating to actions:
- Core params input. Some of the components (e.g. physics) add actions.
- Core params, which also adds actions.
- During stepper construction, action sequence
- Then state, which initializes auxiliary data
- Then
begin_runis called on each state
Use cases:
- Geant4 offloading: LocalTransporter registers a pointer to the state with the global parameters
- Celer-sim: Runner creates multiple transporters which are used to step in a parallel openmp loop
Inputs:
- Imported data from Geant4
- Callback functions to generate input functions: physics processes
- Options to construct built-in Celeritas actions (e.g. SD callback, calo, diagnostics)
- Callback to generate additional actions, including along-step (if we do an along-step action manager, it should include a mapping of {region, particle} -> along_step; that stuff shouldn't be done in the CoreParams)
Parallelism/multithread requirements:
- Be able to do parallel (all track slots, streams/CPU threads, MPI processes) operations at beginning, event boundaries (or "batches" of histories in parallel), and end
- Match Geant4 allocation requirements: state creation, use, and destruction must be on same thread
- (??) In thread-independent model, allow different memory management models so that one shared params can be used by different states
During construction
On the [M]ain thread or in [P]arallel:
BeginOfRunAction (main)
- [M] Load import data
- [M] Create global objects that don't have any associated actions (geometry, material, particle)
- [M] Create core params and incorporate any additional user actions
- [M] Create action sequence from core params, which should "finalize" the number of actions and anything action-related
- [M] Create state collection(s) with unallocated per-stream states, based on maximum local threads; potentially have CPU and GPU state collections side by side, incorporate knowledge of parallel MPI processes
BeginOfRunAction (worker)
- [P] Allocate & construct state from core params/action sequence
- [P] (??) State registers itself with state collection or params for parallel operations, or done implicitly as part of construction
- [P] State allocates and constructs auxiliary data
Not currently done in geant4
- [M] Register signal handler on main thread; if called, it sets "abort" flags on all states (note: currently done inside stepper, which is wrong in MT mode)
- [M] Call
begin_runon all states (not yet implemented) - [P] Warm up (note: this is done in celer-sim)
Not currently used for anything interesting
celeritas::TrackingManagerOffloadviaG4VPhysicsConstructor::ConstructProcessviaG4RunManagerKernel::InitializePhysicsTrackingManagerOffload::BuildPhysicsTable,TrackingManagerOffload::PreparePhysicsTable
During execution
At runtime we have two different use cases:
- Synchronized events/batches across all threads; this would be for dosimetry, reactor applications, optical maps. We could have a single "event" and distribute across all states on all threaads and all processes.
- Independent events on each thread; this is for Geant4.
BeginOfEventAction (worker)
Asynchronous events:
- [P] Reseed with event ID
- [P] Call
begin_eventon each state independently?
Not currently done
Synchronized event (batch) at beginning:
- [P] Distribute primaries/initializers/generators across states
- [M] Call
begin_eventon all states (?)
PreUserTrackingAction or HandOverOneTrack
- [P] Push a track onto the stack
EndOfEventAction or FlushEvent
During event/batch, repeat:
- [P] Step
- [P] Kill active tracks if requested (e.g., user abort)
- Don't accumulate counters as part of the stepper: that should be a step action
Synchronized event (batch) at end:
- [M] Call
end_eventon all states
EndOfRunAction (worker)
At end of run:
- [M] Call
end_runon all states - [P] Deallocate states
EndOfRunAction (main)
- Free Celeritas objects and memory
TODO:
- Refactor streams so that each state holds a
Streamobject rather than having to redirect into theDeviceobject and allocate those streams there - Eliminate
max_streamsonce we get rid of the stream store. - State counter diagnostic should be part of an action rather than hardcoded into the steam
- Initializer/generator interface should let us accumulate the number of expected primaries (?)
- Along-step manager
- User-supplied callbacks to generate additional actions; core params setup/initialization should be more conslidated
I'd be grateful for input on this discussion from a parallelism standpoint (@amandalund) and Geant4 mechanics standpoint (@drbenmorgan). I'd like to have a structure that is compatible with all use cases and targeted at the Geant4 use case. The main questions I have are:
- Is it too restrictive to have a
StateVectorwhere each element corresponds to a stream/CPU thread? That would make it really easy to pass into "parallel reduce" methods at the beginning/end of run to (e.g.) sum energy deposition across threads, or action times, and output them. - Where should I put
begin_runso that it's after all threads have been allocated? Maybe just on the last thread to callBeginOfRunAction? (But I assume one thread could call "begin event" before another calls "begin of run"... and we don't want to force a synchronization.) Or do we just try to eliminate begin-of-run action? (Currently its primary use is for "lazy" initialization of params that depend on the number of actions. If we "hardcode" the use of such objects so that they're added after user actions, we could get away with this.)
- Where should I put
begin_runso that it's after all threads have been allocated? Maybe just on the last thread to callBeginOfRunAction? (But I assume one thread could call "begin event" before another calls "begin of run"... and we don't want to force a synchronization.) Or do we just try to eliminate begin-of-run action? (Currently its primary use is for "lazy" initialization of params that depend on the number of actions. If we "hardcode" the use of such objects so that they're added after user actions, we could get away with this.)
One possible option could be to make use of G4StateManager and G4VStateDependent to keep track of the Begin/End of Run/Event (caveat emptor: I'd have to dig into the thread locality semantics G4StateManager). In short, concrete classes of G4StateDependent get registered with G4StateManager and notified every the Geant4 internal state changes, as defined in G4ApplicationState.hh. By storing the previous state on each call, a concrete G4VStateDependent should be able to reliably detect Begin/EndOf{Run,Event}, see for example G4VisStateDependent.
However, this would change the point at which anything done in Celeritas is done relative to the user actions. For example, the Event processing transitions happen a bit before/after the actions are called:
- https://gitlab.cern.ch/geant4/geant4/-/blob/master/source/event/src/G4EventManager.cc?ref_type=heads#L103
- https://gitlab.cern.ch/geant4/geant4/-/blob/master/source/event/src/G4EventManager.cc?ref_type=heads#L344
This is a bit less certain in the RunManager, but it's safe to say the notification happens before the user begin-of-run action and after the user end-of-run-action, and one could distinguish worker threads from the main thread here.
I can dig a bit deeper on the above if useful, but wanted to check I'm answering the right question first! Generally, the less use of user actions Celeritas requires, but it's also the most portable across versions even if a little user setup is needed to add them in.
I've actually been looking at G4StateManager recently too. The documentation doesn't explain anything about thread locality, however, looking at the G4StateManager implementation, instances are thread-local. So my understanding is that the callback for a given G4VStateDependent instance will be done by the same thread that registered the instance.
I recently updated the documentation to try to better describe how Geant4 initialization works.
The places where we currently use begin_run (which takes only the local state and is called locally, at the end of Stepper construction right after the state is allocated) are:
OpticalLaunchActionto defer state creation so that optical actions can be added after constructionExtendFromSecondariesActionto "warm up" the async alloc for improved profilingSortTracksActionto check consistency between state and params action countStatusCheckerto defer initialization of state sizes until runtime (should use aux data withweak_ptr<ActionRegistry>)ActionDiagnostic(same)
I think what we should do is:
- BeginOfRunAction on the main thread sets up SharedParams (what we currently do)
- It also allocates all GPU state(s) into one container. However, at this point we cannot allocate Geant4 objects (e.g., HitProcessor), nor can we store/query thread-local pointers.
- Geant4 state data should always be "lazily" created, or we could add another hook for states that need thread-local initialization.
Having the unified initialization in setup::problem (called once on main thread) helps:
- Core params are created
- Diagnostics are added
- Additional user actions are added
- DURING SETUP: we actually construct the state vector, and set up aux data
For now let's not worry about synchronization at an event level; perhaps that would be better handled by some MPI-aware task library...
Also I just learned about G4Run::Merge. It looks like it's called only from worker threads as they end, and not from the main thread nor in serial mode, so I'm not sure it's useful.
@LSchwiebert This is the issue we have for refactoring the usage of the "stepper", so we can hack it to pieces in the meantime