Optimize track vector data layout for particle types
I started sketching out a possible implementation of #1312 (still a WIP). This partitiions the last num_new_tracks initializers by charged/neutral and initializes neutral and charged tracks at opposite ends of the track vector.
For now I've disabled the immediate initialization of. a secondary in the parent's track slot (since that results in quite a bit more mixing) and the copying of the parent's geometry state (allowing this would require partitioning the parent IDs as well, which we could do). FWIW I didn't see a ton of speedup from those geometry optimizations anyway (maybe a few percent at most, in cms2018).
I might test this today... also the test segfaults are caught by clang asan:
[ RUN ] PartitionDataTest.init_primaries_host
=================================================================
==2683==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7f277607cc58 at pc 0x00000038c3b0 bp 0x7ffc76ad7e70 sp 0x7ffc76ad7e68
READ of size 8 at 0x7f277607cc58 thread T0
@sethrj @esseivaju I think this is ready for a look and some testing/profiling now. I haven't done much myself yet, but in some quick initial comparisons saw ~25% improvement on testem3 and ~5% on cms2018 with celer-sim. With merge_events off I saw only a 15% improvement with testem3 and a performance degradation with cms; I haven't looked at the performance of any of the Geant4-integrated apps yet.
Seems to be barely any bump in performance for Frontier/AMD. @esseivaju if you run the
v0.5.0-dev.partitioned branch I'll generate the equivalent plot for perlmutter for the whole run of problems.
I'll try to run this tomorrow
Copying the output from the slack channel. We're obviously not yet copying the track sort parameter into the GPU setup for Celeritas. It's very odd that the CPU "sort" performance dips in a couple of cases.
@amandalund I think this is good to go aside from the minor comments and the conflict 😄 I know it's probably been lost in the pile of pull requests 😅
Thanks @sethrj! I'd modified this to try to reuse the parent's geo state when possible for improved performance, but got sidetracked when I added an assertion that the parent's position match the initializer position and several unit tests started failing. Should be able to wrap this up pretty quickly now that those fixes are in.
Alright, I've updated this so we're now partitioning an array of indices (rather than track initializers) which we can also use to access the parent track slot IDs to copy the geometry state. We'll still have to reinitialize the geometry more often than we would when not sorting (we're not doing the in-place initialization, so we can't reuse the state if the track was killed but produced secondaries), but hopefully it should still help some. @esseivaju if you wouldn't mind rerunning the regression problems with the latest changes and @sethrj updating the performance plot again I'd appreciate it!
Rerunning frontier now...
Frontier's seeing a solid 10% speedup with this 😄 nice work!
Perlmutter's also seeing similar speedups for testem3. The Run 2 geometry minus MSC shows that same anomaly...
@esseivaju did you get a chance to re-run develop for the CMS/run2/no-msc case? It would be good to get this in but I'd like to understand the performance regression first.
I also ran it actually (on an A100 + AMD EPYC 7532):
Weird...
It's possible the worse performance for cms+msc could still be the penalty of having to reinitialize the geometry state more often than on develop...
That makes a lot of sense, good thought @amandalund . The cost of a step iteration without MSC or field is relatively low compared to with MSC and field, and the cost of initialization with vecgeom/CMS is much higher than testem3. We could potentially do a better job of serializing the geometry state for fast reconstruction by storing an additional "unique volume ID" (aka VecGeom navindex, or expanded ORANGE volume ID) on the track and defining a new initializer that reconstructs the logical geometry state more efficiently than searching based on the point.
Yeah it definitely seems worthwhile to explore ways of speeding up that initialization.
Sorry @sethrj I only just realized that M means !MSC 😅. Maybe we should invert that?
There's supposed to be a tilde on top; matplotlib didn't render it correctly. Since the results aren't correct without msc I figured it was better to start denoting "not included" rather than "accurate"...
It does seem better to better to mark that some problems are missing key physics than the other way around... though on the other hand $F$ for field and $M$ for msc is much more obvious when glancing at a plot than $F$ for field and $\tilde{M}$ for no msc.
I'm cool with changing this. Maybe better (if we want to go crazy) would be to replace the "no MSC" mode with full coulomb scattering? I have no idea if that'll increase our run times by 1000x though...