Asynchronously copy primaries on push

Open sethrj opened this issue 3 months ago • 1 comments

The primary push mechanism with the stepper takes a Span, resizes a collection, and copies (with a stream, but still usually synchronously since the span points to unpinned memory) primaries to the GPU. This causes a substantial slowdown when generating lots of primaries from Geant4 that have relatively low energy (see https://github.com/celeritas-project/celeritas/discussions/1941 ). Maybe we want a double buffer/ring buffer that uses pinned memory with head/tail pointers, copies a chunk at a time (perhaps uses cudaLaunchHostFunc to update a host pointer to the last successfully copied block after the async copy), and blocks if head == tail + 1 (buffer full). We could make this class reusable for primaries and hits.

Sep 17 '25 11:09 sethrj

The optical offload will have a similar push mechanism for copying generator distribution data to the GPU.

Oct 29 '25 17:10 amandalund