stella Stella MPI + MPI shared memory approach

Currently stella's scaleability is limited by its parallelisation of velocity space alone, relying on a redistribution between a space-local grid and a velocity-local grid. I think the fastest and most straightforward way to achieve more scaleability with what's already in Stella is to use a hybrid shared memory approach. While this is typically done with MPI + openmp, stella already takes advantage of MPI's shared memory framework using windows and mixed communicators, so I think a MPI + MPI approach would be much faster to get to production.

The idea is this: instead of distributing the velocity grid over the number of available cores, instead distribution over the number of available nodes. Then on a node, most of the operations can be parallelized over naky (and when the time comes, nakx for the nonlinearity). This would require a number of small modifications in a few subroutines, rather than needing to rewrite array sizes, create new layouts and redistributors, etc...

The question that remains is can this be exploited for the velocity-local operations as well, such as the mirror term and collisions? In principle the latter should be doable, since mu acts like ky here. For collisions, this is more unclear, unless the vpa and mu operators are always decoupled (like for the Dougherty operator).

Sep 25 '21 22:09 DenSto

A good place to get started on this is to look how the MPs shared memory framework is used in response_matrix.fpp. Additionally, one should get familiar with the mpi communicators created in utils/mp.fpp, particularly comm_shared and comm_node.

Sep 25 '21 22:09 DenSto

One more consideration: Due to the stellarator nature of stella, keep in mind that ky is actually the fast index for the distribution function/field arrays, so parallelizing over ky might not the best idea. The beauty of the shared memory approach however is that one is free to parallelize over any of the spatial indices, wherever convenient, and if one decides to transpose the arrays around the spatial indices (i.e. put ky in the last local index), that can be done without any MPI communications, so no all-to-all communications are needed.

Sep 26 '21 10:09 DenSto

In spatial localisation we are including x, y, and zed?

On Sep 26, 2021, at 6:19 AM, Denis St-Onge @.***> wrote:

One more consideration: Due to the stellarator nature of stella, keep in mind that ky is actually the fast index for the distribution function/field arrays, so parallelizing over ky might not the best idea. The beauty of the shared memory approach however is that one is free to parallelize over any of the spatial indices, wherever convenient, and if one decides to transpose the arrays around the spatial indices (i.e. put ky in the last local index), that can be done without any MPI communications, so no all-to-all communications are needed.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/stellaGK/stella/issues/23#issuecomment-927277880, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBHCNEDHK6E4DCALRKRQYDUD3XSHANCNFSM5EYBHCOQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Sep 26 '21 13:09 jfparisi

Yes, that's right.

In spatial localisation we are including x, y, and zed? … On Sep 26, 2021, at 6:19 AM, Denis St-Onge @.***> wrote: One more consideration: Due to the stellarator nature of stella, keep in mind that ky is actually the fast index for the distribution function/field arrays, so parallelizing over ky might not the best idea. The beauty of the shared memory approach however is that one is free to parallelize over any of the spatial indices, wherever convenient, and if one decides to transpose the arrays around the spatial indices (i.e. put ky in the last local index), that can be done without any MPI communications, so no all-to-all communications are needed. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBHCNEDHK6E4DCALRKRQYDUD3XSHANCNFSM5EYBHCOQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Sep 26 '21 19:09 DenSto

Some more thoughts:

The magnetic and (linear) ExB drifts are straight forward to locally parallelize
The ExB nonlinearity is a little trickier. I think the most straightforward approach is to just FT the whole spatial domain at once, instead of slice-by-slice in $\theta$. This larger loop should be easier to efficiently chop up. The disadvantage here is that we now need 3 large arrays to store the real-space data, which due to padding, are 2.25x larger than our current spatial arrays. I haven't had any memory issues with stella yet, so we might be OK...
Mirror term again is trivial here, as there should always be enough spatial points to loop over
Parallel streaming is the hard part... not sure exactly what to do with the bidiagonal solve yet. What can be done is implement a locally parallelized back-substitution for the response matrix (this can be done now, and really should be done soon!)

Jun 06 '22 01:06 DenSto