parenchyma icon indicating copy to clipboard operation
parenchyma copied to clipboard

Async + Mitigate host-device memory transfer bottlenecks

Open jonysy opened this issue 8 years ago • 3 comments

An application is only as fast as its slowest part..

Taken from the SO question: mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA

There are a couple things you can try to mitigate the PCIe bottleneck:

  • Asynchronous transfers - permits overlapping computation and bulk transfer
  • Mapped memory - allows a kernel to stream data to/from the GPU during execution

Full answer.

jonysy avatar Feb 15 '17 19:02 jonysy

L48:

Async operations: it looks like currently most time is spent waiting for in/out transfers even on mid-range GPU hardware. Async may help a lot. Async can be implemented by making transfer_in/transfer_out to return an object that can be waited on until transfer completes when sync is required, e.g. CUDA -> Host. Tensor::get_memory() could block until transfer completes.

jonysy avatar Mar 06 '17 01:03 jonysy

@drahnr's point on OpenCL implementation via Gitter:

The way it is currently implemented lacks the either a cl finish or waiting for the last event in the chain (forward propagation)

jonysy avatar Mar 24 '17 18:03 jonysy