FLAMEGPU2 icon indicating copy to clipboard operation
FLAMEGPU2 copied to clipboard

Concurrency/Stream Improvements

Open ptheywood opened this issue 4 years ago • 1 comments

Once #379 is merged, there are still improvements that can be made to the use of streams to improve performance through the use of streams, within a single simulation and when used as part of an ensemble.

Some but not all of the potential improvements / outstanding todos

  • Use non-default streams in more places

    • [ ] RandomManager::resizeDeviceArray / RandomManager::resize
    • [ ] CUDAScanCompaction::zero
    • [ ] mapNewRuntimeVariables
    • [ ] the use of various CUDAScatter methods which are currently just passed the default stream (0).
  • [ ] Better ways of passing streams around, where the stream belongs to a simulation (or an ensemble?).

  • [ ] Memory Pinning

    • Async memcpy block unless the memory is pinned.
    • Cannot pin everything, as pinning too much memory can cause systems to lock up (by preventing the OS from paging anything)
  • [ ] _async variants of some methods (i.e. some CUDAScatter methods)

    • Allows these to be used without synchronisation when streams are passed. Less syncs are better (where possible) but this should be opt in (and clear)
    • The non async methods can just call the _async version + add a stream sync, so minimal overhead of maintaining this.
    • Some of these return values copied back, so require the sync. In that case switching to a batch operation to process N reductions concurrently may be required.
  • Expanded Testing

    • [ ] Test(s) for each communication strategy
    • [ ] Make the tests check for more than just performance
    • [ ] More RTC test coverage
    • [ ] Performance test(s) within an ensemble
    • [ ] Attempt to test the concurrency of pre/post processing (i.e. scatter) although this may be difficult to time accurately
  • [ ] More refactoring of stepLayer - it's still a huge method.

    • Possibly use methods in an unnamed namespace to prevent them being called by users.
  • [ ] Per layer timing

    • Additional syncing/events might have a negative impact on perf, + potentially high memory requirements (one element per layer per step (per simulation in an ensemble)). May be inaccurate on WDDM devices?
  • [ ] Timing within Ensembles (Logging)

    • Timing of individual parts of individual simulations is less important when part of an ensemble, but might still be useful.
    • It should be made accessible through logging (or as part of the ensemble object?)
  • [ ] Use a dynamic range of per-stream elements, rather than a hard cap at 128. This was naively used as it is the limit on the number of concurrent streams which can execute, but models could have more than 128 individual kernels launched within a layer, they would just be serialised.

ptheywood avatar Feb 17 '21 18:02 ptheywood

See the cineca_concurrency branch for some steps towards this, focussed on fixing the regression introduce by automatic IDs so far.

ptheywood avatar Jun 23 '21 14:06 ptheywood