celeritas Optimize track vector data layout for particle types

I started sketching out a possible implementation of #1312 (still a WIP). This partitiions the last num_new_tracks initializers by charged/neutral and initializes neutral and charged tracks at opposite ends of the track vector.

For now I've disabled the immediate initialization of. a secondary in the parent's track slot (since that results in quite a bit more mixing) and the copying of the parent's geometry state (allowing this would require partitioning the parent IDs as well, which we could do). FWIW I didn't see a ton of speedup from those geometry optimizations anyway (maybe a few percent at most, in cms2018).

Jul 14 '24 23:07 amandalund

I might test this today... also the test segfaults are caught by clang asan:

[ RUN      ] PartitionDataTest.init_primaries_host
=================================================================
==2683==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7f277607cc58 at pc 0x00000038c3b0 bp 0x7ffc76ad7e70 sp 0x7ffc76ad7e68
READ of size 8 at 0x7f277607cc58 thread T0

Jul 26 '24 16:07 sethrj

@sethrj @esseivaju I think this is ready for a look and some testing/profiling now. I haven't done much myself yet, but in some quick initial comparisons saw ~25% improvement on testem3 and ~5% on cms2018 with celer-sim. With merge_events off I saw only a 15% improvement with testem3 and a performance degradation with cms; I haven't looked at the performance of any of the Geant4-integrated apps yet.

Aug 05 '24 22:08 amandalund

throughput Seems to be barely any bump in performance for Frontier/AMD. @esseivaju if you run the v0.5.0-dev.partitioned branch I'll generate the equivalent plot for perlmutter for the whole run of problems.

Aug 06 '24 12:08 sethrj

I'll try to run this tomorrow

Aug 07 '24 02:08 esseivaju

rel-throughput

Copying the output from the slack channel. We're obviously not yet copying the track sort parameter into the GPU setup for Celeritas. It's very odd that the CPU "sort" performance dips in a couple of cases.

Aug 12 '24 19:08 sethrj

@amandalund I think this is good to go aside from the minor comments and the conflict 😄 I know it's probably been lost in the pile of pull requests 😅

Aug 16 '24 18:08 sethrj

Thanks @sethrj! I'd modified this to try to reuse the parent's geo state when possible for improved performance, but got sidetracked when I added an assertion that the parent's position match the initializer position and several unit tests started failing. Should be able to wrap this up pretty quickly now that those fixes are in.

Aug 16 '24 18:08 amandalund

Alright, I've updated this so we're now partitioning an array of indices (rather than track initializers) which we can also use to access the parent track slot IDs to copy the geometry state. We'll still have to reinitialize the geometry more often than we would when not sorting (we're not doing the in-place initialization, so we can't reuse the state if the track was killed but produced secondaries), but hopefully it should still help some. @esseivaju if you wouldn't mind rerunning the regression problems with the latest changes and @sethrj updating the performance plot again I'd appreciate it!

Aug 17 '24 04:08 amandalund

Rerunning frontier now...

Aug 17 '24 19:08 sethrj

rel-throughput Frontier's seeing a solid 10% speedup with this 😄 nice work!

Aug 19 '24 16:08 sethrj

rel-throughput Perlmutter's also seeing similar speedups for testem3. The Run 2 geometry minus MSC shows that same anomaly...

Aug 19 '24 21:08 sethrj

@esseivaju did you get a chance to re-run develop for the CMS/run2/no-msc case? It would be good to get this in but I'd like to understand the performance regression first.

Aug 28 '24 20:08 sethrj

I also ran it actually (on an A100 + AMD EPYC 7532): rel-throughput

Aug 28 '24 20:08 amandalund

Weird...

Aug 28 '24 20:08 sethrj

It's possible the worse performance for cms+msc could still be the penalty of having to reinitialize the geometry state more often than on develop...

Aug 28 '24 20:08 amandalund

That makes a lot of sense, good thought @amandalund . The cost of a step iteration without MSC or field is relatively low compared to with MSC and field, and the cost of initialization with vecgeom/CMS is much higher than testem3. We could potentially do a better job of serializing the geometry state for fast reconstruction by storing an additional "unique volume ID" (aka VecGeom navindex, or expanded ORANGE volume ID) on the track and defining a new initializer that reconstructs the logical geometry state more efficiently than searching based on the point.

Aug 29 '24 12:08 sethrj

Yeah it definitely seems worthwhile to explore ways of speeding up that initialization.

Aug 29 '24 13:08 amandalund

Sorry @sethrj I only just realized that M means !MSC 😅. Maybe we should invert that?

Aug 31 '24 23:08 amandalund

There's supposed to be a tilde on top; matplotlib didn't render it correctly. Since the results aren't correct without msc I figured it was better to start denoting "not included" rather than "accurate"...

Sep 01 '24 02:09 sethrj

It does seem better to better to mark that some problems are missing key physics than the other way around... though on the other hand $F$ for field and $M$ for msc is much more obvious when glancing at a plot than $F$ for field and $\tilde{M}$ for no msc.

Sep 01 '24 23:09 amandalund

I'm cool with changing this. Maybe better (if we want to go crazy) would be to replace the "no MSC" mode with full coulomb scattering? I have no idea if that'll increase our run times by 1000x though...

Sep 03 '24 12:09 sethrj