celeritas icon indicating copy to clipboard operation
celeritas copied to clipboard

Restructure stepper/transporter/state/params

Open sethrj opened this issue 11 months ago • 7 comments

The Stepper should be refactored since it's grown from a single-purpose "do one step" to a general interface:

  1. I don't think it's right that it owns the state, since we have to extract data from it occasionally. We also set params.set_state(stream_id.get(), step_->sp_state()) in accel so we can do data reductions. (In total there's three accessors to get the state.)
  2. It's no longer a function-like object since we have so many other functions attached to it.
  3. The step(primaries) is a little odd.
  4. The stepper owns an action sequence. Although the action times currently are "local", we ought to have one action sequence per Params.
  5. We should have an "end run" called over all states, and "begin run" should be over all states as well. This is necessary for global reductions.

Current construction order relating to actions:

  1. Core params input. Some of the components (e.g. physics) add actions.
  2. Core params, which also adds actions.
  3. During stepper construction, action sequence
  4. Then state, which initializes auxiliary data
  5. Then begin_run is called on each state

Use cases:

  1. Geant4 offloading: LocalTransporter registers a pointer to the state with the global parameters
  2. Celer-sim: Runner creates multiple transporters which are used to step in a parallel openmp loop

sethrj avatar Dec 30 '24 02:12 sethrj

Inputs:

  • Imported data from Geant4
  • Callback functions to generate input functions: physics processes
  • Options to construct built-in Celeritas actions (e.g. SD callback, calo, diagnostics)
  • Callback to generate additional actions, including along-step (if we do an along-step action manager, it should include a mapping of {region, particle} -> along_step; that stuff shouldn't be done in the CoreParams)

Parallelism/multithread requirements:

  • Be able to do parallel (all track slots, streams/CPU threads, MPI processes) operations at beginning, event boundaries (or "batches" of histories in parallel), and end
  • Match Geant4 allocation requirements: state creation, use, and destruction must be on same thread
  • (??) In thread-independent model, allow different memory management models so that one shared params can be used by different states

During construction

On the [M]ain thread or in [P]arallel:

BeginOfRunAction (main)

  • [M] Load import data
  • [M] Create global objects that don't have any associated actions (geometry, material, particle)
  • [M] Create core params and incorporate any additional user actions
  • [M] Create action sequence from core params, which should "finalize" the number of actions and anything action-related
  • [M] Create state collection(s) with unallocated per-stream states, based on maximum local threads; potentially have CPU and GPU state collections side by side, incorporate knowledge of parallel MPI processes

BeginOfRunAction (worker)

  • [P] Allocate & construct state from core params/action sequence
  • [P] (??) State registers itself with state collection or params for parallel operations, or done implicitly as part of construction
  • [P] State allocates and constructs auxiliary data

Not currently done in geant4

  • [M] Register signal handler on main thread; if called, it sets "abort" flags on all states (note: currently done inside stepper, which is wrong in MT mode)
  • [M] Call begin_run on all states (not yet implemented)
  • [P] Warm up (note: this is done in celer-sim)

Not currently used for anything interesting

  • celeritas::TrackingManagerOffload via G4VPhysicsConstructor::ConstructProcess via G4RunManagerKernel::InitializePhysics
  • TrackingManagerOffload::BuildPhysicsTable, TrackingManagerOffload::PreparePhysicsTable

During execution

At runtime we have two different use cases:

  • Synchronized events/batches across all threads; this would be for dosimetry, reactor applications, optical maps. We could have a single "event" and distribute across all states on all threaads and all processes.
  • Independent events on each thread; this is for Geant4.

BeginOfEventAction (worker)

Asynchronous events:

  • [P] Reseed with event ID
  • [P] Call begin_event on each state independently?

Not currently done

Synchronized event (batch) at beginning:

  • [P] Distribute primaries/initializers/generators across states
  • [M] Call begin_event on all states (?)

PreUserTrackingAction or HandOverOneTrack

  • [P] Push a track onto the stack

EndOfEventAction or FlushEvent

During event/batch, repeat:

  • [P] Step
  • [P] Kill active tracks if requested (e.g., user abort)
  • Don't accumulate counters as part of the stepper: that should be a step action

Synchronized event (batch) at end:

  • [M] Call end_event on all states

EndOfRunAction (worker)

At end of run:

  • [M] Call end_run on all states
  • [P] Deallocate states

EndOfRunAction (main)

  • Free Celeritas objects and memory

TODO:

  • Refactor streams so that each state holds a Stream object rather than having to redirect into the Device object and allocate those streams there
  • Eliminate max_streams once we get rid of the stream store.
  • State counter diagnostic should be part of an action rather than hardcoded into the steam
  • Initializer/generator interface should let us accumulate the number of expected primaries (?)
  • Along-step manager
  • User-supplied callbacks to generate additional actions; core params setup/initialization should be more conslidated

sethrj avatar Dec 30 '24 20:12 sethrj

I'd be grateful for input on this discussion from a parallelism standpoint (@amandalund) and Geant4 mechanics standpoint (@drbenmorgan). I'd like to have a structure that is compatible with all use cases and targeted at the Geant4 use case. The main questions I have are:

  • Is it too restrictive to have a StateVector where each element corresponds to a stream/CPU thread? That would make it really easy to pass into "parallel reduce" methods at the beginning/end of run to (e.g.) sum energy deposition across threads, or action times, and output them.
  • Where should I put begin_run so that it's after all threads have been allocated? Maybe just on the last thread to call BeginOfRunAction? (But I assume one thread could call "begin event" before another calls "begin of run"... and we don't want to force a synchronization.) Or do we just try to eliminate begin-of-run action? (Currently its primary use is for "lazy" initialization of params that depend on the number of actions. If we "hardcode" the use of such objects so that they're added after user actions, we could get away with this.)

sethrj avatar Dec 30 '24 21:12 sethrj

  • Where should I put begin_run so that it's after all threads have been allocated? Maybe just on the last thread to call BeginOfRunAction? (But I assume one thread could call "begin event" before another calls "begin of run"... and we don't want to force a synchronization.) Or do we just try to eliminate begin-of-run action? (Currently its primary use is for "lazy" initialization of params that depend on the number of actions. If we "hardcode" the use of such objects so that they're added after user actions, we could get away with this.)

One possible option could be to make use of G4StateManager and G4VStateDependent to keep track of the Begin/End of Run/Event (caveat emptor: I'd have to dig into the thread locality semantics G4StateManager). In short, concrete classes of G4StateDependent get registered with G4StateManager and notified every the Geant4 internal state changes, as defined in G4ApplicationState.hh. By storing the previous state on each call, a concrete G4VStateDependent should be able to reliably detect Begin/EndOf{Run,Event}, see for example G4VisStateDependent.

However, this would change the point at which anything done in Celeritas is done relative to the user actions. For example, the Event processing transitions happen a bit before/after the actions are called:

  • https://gitlab.cern.ch/geant4/geant4/-/blob/master/source/event/src/G4EventManager.cc?ref_type=heads#L103
  • https://gitlab.cern.ch/geant4/geant4/-/blob/master/source/event/src/G4EventManager.cc?ref_type=heads#L344

This is a bit less certain in the RunManager, but it's safe to say the notification happens before the user begin-of-run action and after the user end-of-run-action, and one could distinguish worker threads from the main thread here.

I can dig a bit deeper on the above if useful, but wanted to check I'm answering the right question first! Generally, the less use of user actions Celeritas requires, but it's also the most portable across versions even if a little user setup is needed to add them in.

drbenmorgan avatar Jul 15 '25 14:07 drbenmorgan

I've actually been looking at G4StateManager recently too. The documentation doesn't explain anything about thread locality, however, looking at the G4StateManager implementation, instances are thread-local. So my understanding is that the callback for a given G4VStateDependent instance will be done by the same thread that registered the instance.

esseivaju avatar Jul 15 '25 21:07 esseivaju

I recently updated the documentation to try to better describe how Geant4 initialization works.

The places where we currently use begin_run (which takes only the local state and is called locally, at the end of Stepper construction right after the state is allocated) are:

  • OpticalLaunchAction to defer state creation so that optical actions can be added after construction
  • ExtendFromSecondariesAction to "warm up" the async alloc for improved profiling
  • SortTracksAction to check consistency between state and params action count
  • StatusChecker to defer initialization of state sizes until runtime (should use aux data with weak_ptr<ActionRegistry>)
  • ActionDiagnostic (same)

I think what we should do is:

  • BeginOfRunAction on the main thread sets up SharedParams (what we currently do)
  • It also allocates all GPU state(s) into one container. However, at this point we cannot allocate Geant4 objects (e.g., HitProcessor), nor can we store/query thread-local pointers.
  • Geant4 state data should always be "lazily" created, or we could add another hook for states that need thread-local initialization.

Having the unified initialization in setup::problem (called once on main thread) helps:

  • Core params are created
  • Diagnostics are added
  • Additional user actions are added
  • DURING SETUP: we actually construct the state vector, and set up aux data

For now let's not worry about synchronization at an event level; perhaps that would be better handled by some MPI-aware task library...

sethrj avatar Sep 26 '25 17:09 sethrj

Also I just learned about G4Run::Merge. It looks like it's called only from worker threads as they end, and not from the main thread nor in serial mode, so I'm not sure it's useful.

sethrj avatar Oct 31 '25 13:10 sethrj

@LSchwiebert This is the issue we have for refactoring the usage of the "stepper", so we can hack it to pieces in the meantime

sethrj avatar Nov 04 '25 14:11 sethrj