Asynchronously copy primaries on push
The primary push mechanism with the stepper takes a Span, resizes a collection, and copies (with a stream, but still usually synchronously since the span points to unpinned memory) primaries to the GPU. This causes a substantial slowdown when generating lots of primaries from Geant4 that have relatively low energy (see
https://github.com/celeritas-project/celeritas/discussions/1941 ). Maybe we want a double buffer/ring buffer that uses pinned memory with head/tail pointers, copies a chunk at a time (perhaps uses cudaLaunchHostFunc to update a host pointer to the last successfully copied block after the async copy), and blocks if head == tail + 1 (buffer full). We could make this class reusable for primaries and hits.
The optical offload will have a similar push mechanism for copying generator distribution data to the GPU.