celeritas Implement single-track CPU for performance and improve integration

One of the open questions for our CMS integration is how well the detectors will work if we invert the [track, step] loop to [step, track] as is necessary for GPU. I believe we can without too much effort add enhanced support for a single-trackslot mode that would give us enhanced CPU performance and better integration characteristics.

Add a new memspace for "compact host"
Value type for compact host stores T instead of vector<T>
CoreStatePtr becomes CoreState<value>* instead of CoreState<reference>* so that we don't have to do additional indirection on each state item.

Apr 01 '24 17:04 sethrj

I did some CPU profiling using callgrind/cachegrind with the following setup:

System: Perlmutter login node
Problem: testem3-orange-field-msc, 1 event, 64 primaries/event
single thread (OMP disabled)
CMake RelWithDebInfo build, CELERITAS_DEBUG=OFF

The graph below shows the estimated cycles spent in each function, weighting instruction fetch, L1, and LL cache miss.

testem3_fm_64p

I noticed that axpy leads to many instruction cache miss but it could be because I didn't pass march/mtune compiler options.

Looking at the L1 read miss, most of them come from XsCalculator::get calls within XsCalculator::operator()

It'd be interesting to see the cache miss in a multithreaded scenario

Apr 24 '24 00:04 esseivaju

@esseivaju Is this with one track slot or the usual number (65K)? I guess the reason I wondered about single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot, and since the many-slot case is not really optimal (in terms of state cache locality and skipped loops due to masking) I wonder whether the call graph would look any different...

Apr 24 '24 12:04 sethrj

This is with 4k track slots.

single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot,

Do you mean that in the single thread case, you saw better performance with one track slot?

Apr 24 '24 19:04 esseivaju

OK 4k track slots, different than our usual regression CPU setting. What does the performance graph look like if you have a single track slot? (Make sure openmp is disabled! 😅) Because I would imagine that with a single track slot you'd get better cache performance for the particle state, even though the cache performance for the "params" data might go down.

Apr 24 '24 19:04 sethrj

Ok, I have some data with a single track slot. I had to set max_steps=-1, and OpenMP is disabled at build time. Without profiling and just running the regression problem, it takes ~3x longer with one track slot.

callgrind_estimate_singletrack

Repeatedly calling ActionSequence:execute has a large overhead because of dynamic_cast and freeing memory. I haven't located what is being freed but it's called exactly 20x per ActionSequence:execute so each action is doing it at some point.

Regarding cache efficiency, it isn't helping that much. Below, I'm showing the L1 cache miss per call to AlongStepUniformMscAction::Execute, (aggregate of instruction miss, +R/W miss) where most cache misses happen.

The first picture is for the single track slot scenario, the second picture is 65k track slots. As expected, you have way less miss per call since you process one track at a time, however, multiplied by how many times you have to call the function, it becomes way worse.

In both cases, ~80% of L1m is for instruction fetch.

Apr 24 '24 22:04 esseivaju

@esseivaju Looks like the allocation is coming from the actions()->label and passing into ScopedProfiling. I'm opening a PR to use string_view for the action labels/descriptions and to delay string allocation in the scoped profiling implementation.

Apr 28 '24 13:04 sethrj