Asynchronous mirror term
Currently, stella uses a sequential approach to operator splitting, where an operator partially advances $g_s$ and passes off to the next operator in the sequence. This isn't the only way to do operator splitting, though: it can also be done in a parallel way instead. Consider a set of operators $\Gamma_i$ and the distribution function at some timestep $g_s^n$. The parallelized approach goes as follows: every operator acts on the distribution function from the previous timestep,
$$\tilde{g}_s^{n,i} = \Gamma_i g_s^n.$$
We then calculate the increments,
$$\Delta g_s^{n,i} = \tilde{g}_s^{n,i} - g_s^n,$$
and so the distribution function at the next timestep is $g_s^{n+1} = g_s^{n} +\sum_i \Delta g_s^{n,i}$. This approach is still first order accurate, so it should perform similarly to what we're already doing.
The advantage of this approach is that, since the operators don't depend on each other, things can happen asynchronously. The idea for a single stella timestep would be as follows:
- Asynchronously perform the all-to-all redistribution of $g_s^n$ for the mirror term
- Calculate the explicit terms
- Collect $g_s^n$ in local velocity space, which was sent in step 1, and calculate the mirror term.
- Asynchronously send information (all-to-all redistribution) from step 3.
- Calculate parallel streaming
- Collect information transmitted in step 4.
- Calculate increments and thus $g_s^{n+1}$.
If things are done correctly, then one shouldn't have to wait around for the all-to-all to complete, which is currently one of the bottlenecks of the code. There are two disadvantages though: one has to now store the increments, which modestly increases memory requirements, and one cannot use the flip-flopping for second order accuracy (though this isn't default behaviour anyway).
I think this approach, along with the shared-memory improvements mentioned in #23, would allow stella to scale to hundreds of thousands of cores. At the very least, it would be interesting to see this could mitigate the all-to-all bottleneck.