accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Asynchronous execution

Open tmcdonell opened this issue 12 years ago • 4 comments

@rrnewton notes in #48 that the current (driver default) behaviour is to spin when waiting for GPU operations to complete, which is not friendly towards other Haskell threads that want to do useful work. We should change this to something that is gentler with CPU resources (CU_CTX_SCHED_BLOCKING_SYNC).

Tangentially related to #13.

tmcdonell avatar May 25 '12 02:05 tmcdonell

Asynchronous execution entails using non-default stream(s) and event waiting for dependencies.

With support for streams and events, we should also (correctly) support asynchronous memory transfer, which additionally requires:

  • The host memory is pinned, so the CUDA driver can do a DMA. Currently Accelerate (base) allocates in pageable memory that is pinned only with respect to the Haskell RTS's GC. Internally, the CUDA driver must copy the data to a pinned region, before performing the DMA.
  • If data transfers and kernels operate in distinct non-default streams these will also overlap on all devices which support the feature (almost all 1.1 and later devices).

tmcdonell avatar Aug 21 '13 00:08 tmcdonell

See also:

  • https://developer.nvidia.com/content/how-optimize-data-transfers-cuda-cc
  • https://developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc

tmcdonell avatar Aug 21 '13 00:08 tmcdonell

Note: this issue is further discussed in June/July 2014 on the accelerate mailing list here.

robstewart57 avatar Jul 07 '14 14:07 robstewart57

This is all possible now, just not exposed very nicely yet. See this profiler output, where compute and data transfer overlaps nicely with full-speed DMA to pinned memory:

screenshot 2016-02-10 15 55 31

Also note this example however, where the CUDA pinned memory allocator is (a) not concurrent, and (b) can be teeeerribly slow:

screenshot 2016-02-10 15 56 31

So we may want to do a nursery-style caching allocator. These screenshots are from different machines, and the latter is a 2-GPU box, so may have further strangeness going on...

tmcdonell avatar Feb 10 '16 21:02 tmcdonell