FLAMEGPU2
FLAMEGPU2 copied to clipboard
Concurrency/Stream Improvements
Once #379 is merged, there are still improvements that can be made to the use of streams to improve performance through the use of streams, within a single simulation and when used as part of an ensemble.
Some but not all of the potential improvements / outstanding todos
-
Use non-default streams in more places
- [ ]
RandomManager::resizeDeviceArray
/RandomManager::resize
- [ ]
CUDAScanCompaction::zero
- [ ]
mapNewRuntimeVariables
- [ ] the use of various
CUDAScatter
methods which are currently just passed the default stream (0
).
- [ ]
-
[ ] Better ways of passing streams around, where the stream belongs to a simulation (or an ensemble?).
-
[ ] Memory Pinning
- Async memcpy block unless the memory is pinned.
- Cannot pin everything, as pinning too much memory can cause systems to lock up (by preventing the OS from paging anything)
-
[ ]
_async
variants of some methods (i.e. someCUDAScatter
methods)- Allows these to be used without synchronisation when streams are passed. Less syncs are better (where possible) but this should be opt in (and clear)
- The non async methods can just call the
_async
version + add a stream sync, so minimal overhead of maintaining this. - Some of these return values copied back, so require the sync. In that case switching to a batch operation to process N reductions concurrently may be required.
-
Expanded Testing
- [ ] Test(s) for each communication strategy
- [ ] Make the tests check for more than just performance
- [ ] More RTC test coverage
- [ ] Performance test(s) within an ensemble
- [ ] Attempt to test the concurrency of pre/post processing (i.e. scatter) although this may be difficult to time accurately
-
[ ] More refactoring of
stepLayer
- it's still a huge method.- Possibly use methods in an unnamed namespace to prevent them being called by users.
-
[ ] Per layer timing
- Additional syncing/events might have a negative impact on perf, + potentially high memory requirements (one element per layer per step (per simulation in an ensemble)). May be inaccurate on WDDM devices?
-
[ ] Timing within Ensembles (Logging)
- Timing of individual parts of individual simulations is less important when part of an ensemble, but might still be useful.
- It should be made accessible through logging (or as part of the ensemble object?)
-
[ ] Use a dynamic range of per-stream elements, rather than a hard cap at 128. This was naively used as it is the limit on the number of concurrent streams which can execute, but models could have more than 128 individual kernels launched within a layer, they would just be serialised.
See the cineca_concurrency
branch for some steps towards this, focussed on fixing the regression introduce by automatic IDs so far.