parenchyma
parenchyma copied to clipboard
Async + Mitigate host-device memory transfer bottlenecks
An application is only as fast as its slowest part..
Taken from the SO question: mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA
There are a couple things you can try to mitigate the PCIe bottleneck:
- Asynchronous transfers - permits overlapping computation and bulk transfer
- Mapped memory - allows a kernel to stream data to/from the GPU during execution
L48:
Async operations: it looks like currently most time is spent waiting for in/out transfers even on mid-range GPU hardware. Async may help a lot. Async can be implemented by making transfer_in/transfer_out to return an object that can be waited on until transfer completes when sync is required, e.g. CUDA -> Host.
Tensor::get_memory()
could block until transfer completes.
@drahnr's point on OpenCL implementation via Gitter:
The way it is currently implemented lacks the either a cl finish or waiting for the last event in the chain (forward propagation)