parenchyma Async + Mitigate host-device memory transfer bottlenecks

Async + Mitigate host-device memory transfer bottlenecks

Open jonysy opened this issue 8 years ago • 3 comments

An application is only as fast as its slowest part..

Taken from the SO question: mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA

There are a couple things you can try to mitigate the PCIe bottleneck:

Asynchronous transfers - permits overlapping computation and bulk transfer

Mapped memory - allows a kernel to stream data to/from the GPU during execution

Full answer.

Feb 15 '17 19:02 jonysy

L48:

Async operations: it looks like currently most time is spent waiting for in/out transfers even on mid-range GPU hardware. Async may help a lot. Async can be implemented by making transfer_in/transfer_out to return an object that can be waited on until transfer completes when sync is required, e.g. CUDA -> Host. Tensor::get_memory() could block until transfer completes.

Mar 06 '17 01:03 jonysy

@drahnr's point on OpenCL implementation via Gitter:

The way it is currently implemented lacks the either a cl finish or waiting for the last event in the chain (forward propagation)

Mar 24 '17 18:03 jonysy

A look at GPU memory transfer

Mar 25 '17 19:03 jonysy

parenchyma parenchyma copied to clipboard

Async + Mitigate host-device memory transfer bottlenecks

parenchyma
parenchyma copied to clipboard